<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Bird-Species Audio Identification, Ensembling 1D + 2D Signals</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Gyanendra</forename><surname>Das</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Indian Institue of Technology</orgName>
								<address>
									<settlement>Dhanbad</settlement>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<author role="corresp">
							<persName><forename type="first">Saksham</forename><surname>Aggarwal</surname></persName>
							<email>sakshamaggarwal20@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="institution">Indian Institue of Technology</orgName>
								<address>
									<settlement>Dhanbad</settlement>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Bird-Species Audio Identification, Ensembling 1D + 2D Signals</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">734DDD9029637EC30F354F55E92CF283</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T20:48+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Deep Learning</term>
					<term>Bird Species Classification</term>
					<term>Transfer Learning</term>
					<term>Attention Mechanism</term>
					<term>Sound Detection</term>
					<term>Audio Source Detection</term>
					<term>Demucs</term>
					<term>Resnet 50</term>
					<term>Efficient Net</term>
					<term>Ensembling</term>
					<term>Multi Domain Meta Training</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this paper, a method for recognizing bird species in audio recordings is described. we have experimented with 4 different approaches. Model on Spectrograms and Waveform domain consists of two main models: 1) A binary classifier for predicting if bird call is present in the audio or not; 2) A multiclass classifier for predicting which bird is present. Combining these two approaches, 1D and 2D signals, gives strong results. We also experiment on ATDemucs which extends Demucs , replacing the BiLSTM with self-attention. In this approach, we first do source separation of multiple birds along with noise separation as Universal Source Separation. Then we classify each source, both using a 1D waveform model ReSE-Multi, with self-attention and a 2D spectrogram model. We also discuss how we handle different thresholds for different models by a postprocessing technique. Ensembling techniques like Voting, Scaling and Direct Averaging gave us a good boost in our results. Our combined architecture including 1D and 2D signals achieves 0.6179 micro-averaged F1 in the task that asked for classification of 397 bird species.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>There are about 10,000 different bird species in this world, and they all play an important role in the natural world. They serve as good indicators of declining habitat quality and pollution. It is often easier to hear birds than it is to see them. BirdCLEF 2021 <ref type="bibr" target="#b0">[1]</ref> -Birdcall Identification is a Kaggle competition organized by The Cornell Lab of Ornithology in collaboration with LifeCLEF 2021 <ref type="bibr" target="#b0">[1]</ref> whose challenge is to identify which birds are calling in long recordings, given training data generated in meaningfully different contexts. This paper is structured in a way that it first gives details of the competition and the given data so that there is a clear understanding of the challenges posed by the train and test data. Also, we will provide a detailed solution to the approaches we have used for this challenge including data preparation, augmentations, model building, training procedure, and post-processing techniques.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Data</head><p>This section gives a brief overview of the data provided in the competition. Training on the data posed a lot of challenges since the train and test data were of different types.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Training Data</head><p>The training data is mainly comprised of two types of audio recordings:</p><p>Train short audio: The bulk of the training data consists of short recordings of individual bird calls generously uploaded by users of xeno-canto.org. These files have been downsampled to 32 kHz where applicable to match the test set audio and converted to the ogg format. Information of 397 unique species has been given. Along with audio files, metadata is also provided which consists of primary label, secondary labels, type, latitude, longitude, scientific name, common name, author, date, filename, license, rating, time, and URL.</p><p>Train Soundscapes: There is a distinct shift in acoustic domains between the training and test set. So, some examples of soundscape recordings from the test set have been provided for training and validation purposes. These 20 recordings represent 2 of the 4 test recording locations and are of length 10 minutes each. In the metadata, information has been given as to which birds are present in each of the 5 second timestamps in the training soundscapes. Nocall label has been assigned if no bird is present.</p><p>All labels for train short audio had to be considered as weak labels since we did know which species is audible in the recording, but we did not know the exact timestamps of the vocalizations. Training with weakly labeled data was one of the core challenges of this competition. Secondary label lists the number of audible background species as annotated by the author. These lists might be incomplete and not very reliable. Also, the training data had a long-tail distribution making the dataset highly imbalanced as the head classes contained some species having train sequences more than 500 whereas some in the tail region had around mere 10 -20 sequences.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Test Data</head><p>It has approximately 80 audio recordings similar to train soundscapes. They are of 10 minutes each. We need to identify the birds present in each of the 5 second timestamps throughout the audio. These recordings are from 4 locations .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Our Approach</head><p>We used 4 different approaches to train our model </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Model on Spectrograms</head><p>In this approach, we trained the model on Mel-Spectrograms. We trained 2 types of modelsmodel A and model B. Model A was trained to predict whether a bird is present or not in an audio clip i.e. it was a binary classification model. To train the model, we used an external dataset Freefield1010 <ref type="bibr" target="#b1">[2]</ref> along with competition data. Model B was trained to classify the birds' species. Official competition data <ref type="bibr" target="#b2">[3]</ref> was used for this model and we tried not to input any case of nocall making use of the weak labels generated by Model A when run on the competition dataset 1. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Data Preparation</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Model Building</head><p>Transfer learning from State of The Art Image-net Models to Sound Classification.</p><p>For Model type A we took 3 pretrained models i.e. Efficient B0 <ref type="bibr" target="#b3">[4]</ref>, Resnet50 <ref type="bibr" target="#b4">[5]</ref> and Densenet <ref type="bibr" target="#b5">[6]</ref>. We noticed that SpecAugment <ref type="bibr" target="#b6">[7]</ref> was not giving good results, but SpecChannelShuffle increased the model performance by 0.07. We got the highest score of 0.91 F1 score by Efficientnet-b0 and by blending three models we got 0.93 F1 Score on Freefield1010 <ref type="bibr" target="#b1">[2]</ref> Dataset to classify if there is any bird present or not.</p><p>For Model type B we experimented with many pretrained models including Efficientnet B0, B1, B2, B3, B4, Resnet 50, Nfnet <ref type="bibr" target="#b7">[8]</ref> and Resnet WSL <ref type="bibr" target="#b8">[9]</ref> . We mention the result of this in result section 1. Here SpecAgument worked very well.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Augmentation</head><p>We executed data augmentation during the training stage.</p><p>• SpecAugment: SpecAugment is a popular augmentation technique applied on spectrogram.</p><p>The spectrogram is transformed by warping it in the time direction, masking blocks of consecutive frequency channels, and masking blocks of utterances in time. We noticed that SpecAugment increased model performance without requiring any further model or training parameter tweaks.</p><p>-TimeMasking: In time masking, t consecutive time steps [𝑡 0 , 𝑡 0 + t) are masked where t is chosen from a uniform distribution from 0 to the time mask parameter T, and 𝑡 0 is chosen from [0, 𝜏 − t) where 𝜏 is the time steps.</p><p>-FrequencyMasking: In frequency masking, frequency channels [𝑓 0 , 𝑓 0 + f) are masked where f is chosen from a uniform distribution from 0 to the frequency mask parameter F, and 𝑓 0 is chosen from (0, 𝜐− f) where 𝜐 is the number of frequency channels.</p><p>• SpecChannelShuffle: Shuffle the channels of a multichannel spectrogram (channels last). This can help combat positional bias. • MixUp <ref type="bibr" target="#b9">[10]</ref>: We did mixup according to primary labels that is we combined the melspectrograms according to a parameter alpha which had been taken from beta distribution and also took weighted average of the target label according to the same alpha. Mixup helps in reducing memorization of corrupt labels and acts as a good regularizer during training.</p><formula xml:id="formula_0">𝐼𝑚𝑎𝑔𝑒 𝑖 = 𝛼 * 𝐼𝑚𝑎𝑔𝑒 𝑖 + (1 − 𝛼) * 𝐼𝑚𝑎𝑔𝑒 𝑗 𝑇 𝑎𝑟𝑔𝑒𝑡 𝑖 = 𝛼 * 𝑇 𝑎𝑟𝑔𝑒𝑡 𝑖 + (1 − 𝛼) * 𝑇 𝑎𝑟𝑔𝑒𝑡 𝑗</formula><p>Here Image represents the raw input image array and target represents the label (one-hot encodings) of the corresponding image.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.">Training Procedure</head><p>The training procedure used for both the models is as follows:</p><p>Model A: The model was fed with both Freefield1010 as well as Competition data and the above augmentations were applied on them. Smaller models were trained for 15 epochs while larger models were trained for 8 epochs. We used linear learning rate for the first few epochs to provide warmup and after reaching its peak i.e. 0.002, it was linearly reduced. Adam <ref type="bibr" target="#b10">[11]</ref> optimizer was giving the best result for this model.</p><p>Model B: The model was fed with competition data only and augmentations similar to that of Model A were applied. Smaller models were trained for 40 epochs while larger models were trained for 25 epochs. A similar strategy was used for learning rate scheduler as that of Model A. The optimizer used was Adam. While training we froze all the layers but the last few for the initial few epochs to help the model converge faster. Then all the layers were unfrozen and trained for the remaining epochs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Model on Waveform Domain</head><p>In this approach, we trained the model on raw audio sample in Waveform domain. Here also we train 2 types of models-model A and model B 1. Model A was trained to predict whether a bird is present or not in an audio clip i.e. it was a binary classification model. Model B was trained to classify the birds' species. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Data Preparation</head><p>We resampled the raw wave to 16000 Hz sampling rate. Then we let 𝑚𝑎𝑥 𝑙 be the max length of the audio. If the length was less than 𝑚𝑎𝑥 𝑙 , we padded it with 0 at one end whereas if the length was greater than 𝑚𝑎𝑥 𝑙 , we cut the audio from both the side.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Model Building</head><p>This Model is highly motivated by ReSE-2-Multi <ref type="bibr" target="#b11">[12]</ref>. With this frame-level raw waveform input, the bottom layer filters should learn all conceivable phase variations of (pseudo-)periodic waveforms that are likely to be present in audio signals. This has hampered the usage of raw waveforms as input over spectrogram-based representations in which the phase fluctuation within a frame is taken into account (i.e. time shift of periodic waveforms) is removed by taking merely the magnitude. So we added an Attention layer between two FC (Fully Connected Layer) 2. It's a simple Convolutional Long short-term memory Deep Neural Network (CLDNN) Model <ref type="bibr" target="#b12">[13]</ref>, with residual Connections which will impact high level features of Audio data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">Augmentation</head><p>• AddImpulseResponse: Convolve the audio with a random impulse response.</p><p>• TimeMask: Make a randomly chosen part of the audio silent.</p><p>• AddGaussianSNR: Add gaussian noise to the samples with random Signal to Noise Ratio (SNR) <ref type="bibr" target="#b13">[14]</ref> • AddGaussianNoise: Add gaussian noise to the samples • We add pink noise at variable volumes, as well as random soundscape • We also used a Butterworth filter with stochastic cutoffs (randomly lowpass, highpass, bandpass, bandstop). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.">Training Procedure</head><p>Model A used both Freefield1010 data as well as competition data for training whereas Model B used for competition data. the augmentations already stated above were applied to these raw audio samples.The rest of the training procedure is very similar to that of Wave-gram model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Multi-Domain Meta Training</head><p>After training the whole dataset in spectrograms domain and waveform domain, we check our hypothesis of combining the result from both domains so that both the models have other model's domain knowledge. For training, we froze all layers but the last 5 layers for Wave-gram Training Model 𝑀 𝑔 , and in case of Wave-Form Domain Training Model 𝑀 𝑓 , we froze all the layers except the last 3 layers. We calculated the loss using the below method which would back-propagate through both the models.</p><p>𝑂 𝑔𝑖 = 𝑀 𝑔 (𝑆𝑝𝑒𝑐(𝑋 𝑖 ))</p><formula xml:id="formula_1">𝑂 𝑓 𝑖 = 𝑀 𝑓 (𝑋 𝑖 ) 𝐿𝑜𝑠𝑠 𝑖 = 𝐶𝑟𝑖𝑡𝑒𝑟𝑖𝑜𝑛(𝑂 𝑔𝑖 , 𝑇 𝑖 ) + 𝐶𝑟𝑖𝑡𝑒𝑟𝑖𝑜𝑛(𝑂 𝑓 𝑖 , 𝑇 𝑖 )</formula><p>We got a boost by 0.05 in Cross Validation Score with this technique.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">ATDemucs</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.">Motivation</head><p>In test set and train soundscapes, the audio file contains different types of birds. We thought of separating them and then training the classification models. We decided to introduce the music source separation concept in the multi-class classification task and experimented on it. The model is highly motivated by Demucs <ref type="bibr" target="#b14">[15]</ref>. We provide the code in our GitHub repository ??.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2.">Data Preparation</head><p>We discovered that an audio sample in Train Soundscapes data typically contained a maximum of 5 birds. So we took a hyper-parameter 𝑆𝑒𝑝 𝑁 𝑜 to mix 𝑆𝑒𝑝 𝑁 𝑜 short audios of birds. We did another experiment of mixing the nocall data from Freefield1010 and considered nocall as another bird that needs to be separated. We did the same steps for data preparation as in the Wave Form Domain dataset. We took different 𝑆𝑒𝑝 𝑁 𝑜 of Short Audio of Data and mixed it according to ∑︀ 𝑆𝑒𝑝 𝑁 𝑜 𝑛=1 𝐴 𝑖 For second stage training of this model, we prepared the train soundscapes data by dividing it into chunks of data of length 𝑚𝑎𝑥 𝑙 and trained with the pseudo labels predicted by the first-stage model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.3.">Model Building</head><p>What is the difference between Demucs and ATDemucs? In Demucs there is downsample block and then a BiLSTM <ref type="bibr" target="#b15">[16]</ref> layer and then upsample block. In ATDemucs Figure <ref type="figure" target="#fig_3">3</ref> there is attention in the LSTM layer and upsample block. In our method, we did cross attention between downsample output and upsample output. Downsample Block: The downsample block is made up of a convolution with kernel size K=8, stride S=4, 𝐶 𝑖−1 input channels, 𝐶 𝑖 output channels, and ReLU activation, followed by a 1x1 convolution with GLU <ref type="bibr" target="#b16">[17]</ref> activation. We doubled the number of channels in the 1x1 convolution since the GLU outputs C/2 channels with C channels as input. Horizontal Trans Block: We replace the Bi-LSTM Layer with Self Attention <ref type="bibr" target="#b17">[18]</ref> Layer consisting of 8 heads and Dropout 0.2 and hidden size 𝐶 𝐿 . This block outputs 2𝐶 𝐿 channels per time position. We use a 1x1 convolution with ReLU activation to take that number down to 𝐶 𝐿 . Upsample Block: The Upsample Block is nearly symmetrical to the Downsample Block. It is made up of a convolution with kernel size 3 and stride 1, as well as input/output channels 𝐶 𝑖 and a ReLU <ref type="bibr" target="#b18">[19]</ref> activation. By eliminating simple concatenation like Demucs, we introduce a cross attention layer in which we take a query from the downsample block and a key and value from the upsample block. In addition, return the number of channels 𝐶 𝑖 by doing a 1x1 convolution using GLU activation. Finally, we employ a transposed convolution with K = 8 kernel size and S = 4 stride, 𝐶 𝑖−1 outputs, and ReLU activation. Instead of using an activation function, we output 4𝐶 0 channels for the final layer.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.4.">Augmentation</head><p>• Shift: Randomly shift audio in time by up to 'shift' samples.</p><p>• FlipChannels: Flip left-right channels.</p><p>• FlipSign: Random sign flip.</p><p>• Remix: Within a batch, shuffle the sources. Each batch is divided into groups of size group size, and shuffling is done separately inside each group.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.5.">Training Procedure</head><p>We trained this model in two stages.</p><p>First Stage: First we trained the model on mixing short audio. In this, we have training data input as a combination of 5 different birds' short audios. We train our model to differentiate between these different recordings and separated them. We trained the model for 150 epochs with a learning rate 0.003. We used cosine annealing as the LR Scheduler which starts with a large learning rate which is relatively rapidly decreased to a minimum value before being increased rapidly again. AdamW <ref type="bibr" target="#b19">[20]</ref> optimizer gave good results than others.</p><p>Second Stage: In the train soundscapes, we were given primary labels for the audio recordings at each timestamp of 5 seconds. So, after the first stage we took inference of our model on the train soundscapes and did pseudo labeling so as to finetune the model. We trained the model for 5 epochs with a low learning rate taking AdamW as optimizer. During training, we froze some of the initial layers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.6.">Classification After Separation</head><p>Once our model has been trained to separate the different bird sounds from the main audio recording, we run a classification model on the separated audios so as to classify which bird species it is. For this, we used Resnet50 model with pre-trained weights. We trained the model for approx 20 epochs with Adam optimizer. We got a Cross Validation Score of 0.62.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Pipeline Of Spectograms And WaveForm Domain Model Training.</figDesc><graphic coords="5,175.95,84.19,240.90,157.20" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: ReSE-2-Multi With Attention for WaveForm Domain Model Training</figDesc><graphic coords="6,131.40,84.17,330.00,358.80" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head></head><label></label><figDesc>𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄𝐷 , 𝐾 𝑈 , 𝑉 𝑈 ) = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥(𝑄 𝐷 𝐾 𝑇 𝑈 / √︀ 𝑑 𝑘 )𝑉 𝑈 Where 𝑄 𝐷 is Corresponding Upsample layers value, 𝐾 𝑈 is Downsample layers value and 𝑉 𝑈 is Downsample layers value.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: ATDemucs (Attention + Demucs). It consists of Three types of blocks-Downblock, HorizontalTransBlock and UpTransBlock.As the name, suggest We use Attention in HorizontalTransBlock and UpTransBlock.</figDesc><graphic coords="8,135.52,238.20,321.75,159.75" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head></head><label></label><figDesc>• Resample the dataset to 22050 Hz sampling rate. • Let 𝑡𝑖𝑚𝑒 𝑑 be the accepted minimum duration of anaudio sample. We choose a random 𝑡𝑖𝑚𝑒 𝑑 length of chunk from audio sample • Let 𝑚𝑖𝑛 𝑠 be the accepted minimum duration of the subimage. If the duration is less than 𝑚𝑖𝑛 𝑠 , then we convert it back to same length of 𝑚𝑖𝑛 𝑠 by padding. • Compute three Mel-Spectrogram 𝑀 𝑖 (𝑥) with window sizes 𝑊 𝑖 ∈ (128, 512, 2048). • Concatenate the three 𝑀 𝑖 (𝑥) into one 3 channels RGB multiscale image I</figDesc><table /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">Github Repo https://github.com/Luckygyana/Bird-Species-Audio-Identification-Ensembling-and-1D-2D-Signals</note>
		</body>
		<back>
			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.">Post Processing</head><p>We used two Post Processing Technique Scaling Method: We noticed different models have different best thresholds. So we decided to take them into some scale then add the logits. Let 𝑀 𝑖𝑛 𝑇 ℎ be the minimum threshold of all the models to be ensembled. Then we scaled the logits such a way that all the models' predictions below their respective best thresholds are converted into range 0 to 𝑀 𝑖𝑛 𝑇 ℎ , whereas all the models' predictions above their best threshold are converted int the range 𝑀 𝑖𝑛 𝑇 ℎ to 1. Then we average all the logits thus obtained after scaling and predict all the birds which have more probability than 𝑀 𝑖𝑛 𝑇 ℎ .</p><p>Voting Ensemble: Let 𝑀 𝑖𝑛 𝐶 is the minimum count of bird should present in all models N. We predict all those birds which have ⋂︀ 𝑁 𝑖=1 𝑀 𝑜𝑑𝑒𝑙 𝑖 &gt; 𝑀 𝑖𝑛 𝐶 .</p><p>We submit three type of inference models:</p><p>• Spectrograms Model + Waveform Model: We ensemble all the models by above scaling method, Which gave us Cross Validation Score of 0.732 and LeaderBoard Score of 0.6179. • Multi-Domain Meta Trained Model: We optimize the best threshold for the CV and get Cross Validation Score of 0.705 and LeaderBoard Score of 0.6167 by 0.15 threshold. • ATDemucs: We get the Cross Validation Score score of 0.623 and LeaderBoard Score of 0.59. There are many whereabouts to increase the model accuracy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.">Results</head><p>Table <ref type="table">1</ref> Shows Cross Validation Score of Spectrograms Based Model (Type Model B). After Scaling All the models, we ensembled with a threshold of 0.20 and we got 0.716 accuracy. Direct Validation Score of 0.732 and LeaderBoard Score of 0.6179. We are figuring out the method a good method to ensemble other than averaging of all 3 method along with ATDemucs. We will update all our key findings in the Source code 1 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="10.">Conclusion and future work</head><p>We compose several approaches, specifically a spectrogram architecture, a raw-waveform architecture, and multi-domain meta training. In the spectrogram model as well as the raw waveform model, we used two downstream modules: one for predicting whether a bird is present or not and the other for multi-label classification of the birds. We then combined both these approaches using a loss method that back-propagates through both the models. Also, we experimented with the Demucs model and extended the model architecture by adding an attention layer in upsampling block. Ensembling methods including voting and scaling methods helped achieve better results than any individual model. The spectrogram model along with scaling and downstream modules gave us the best result on the Private Leaderboard which helped us reach 67th position in the competition.</p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Overview of lifeclef 2021: a system-oriented evaluation of automated species identification and species distribution prediction</title>
		<author>
			<persName><forename type="first">A</forename><surname>Joly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Goëau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kahl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Picek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lorieul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Cole</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Deneu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Servajean</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Ruiz De Castañeda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Bolon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Glotin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Planqué</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-P</forename><surname>Vellinga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Dorso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Klinck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Denton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Eggel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Bonnet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Müller</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Twelfth International Conference of the CLEF Association</title>
				<meeting>the Twelfth International Conference of the CLEF Association<address><addrLine>CLEF</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021">2021. 2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">freefield1010 -an open dataset for research on audio field recording archives</title>
		<author>
			<persName><forename type="first">D</forename><surname>Stowell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">D</forename><surname>Plumbley</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Audio Engineering Society 53rd Conference on Semantic Audio (AES53)</title>
				<meeting>the Audio Engineering Society 53rd Conference on Semantic Audio (AES53)</meeting>
		<imprint>
			<publisher>Audio Engineering</publisher>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Overview of birdclef 2021: Bird call identification in soundscape recordings</title>
		<author>
			<persName><forename type="first">S</forename><surname>Kahl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Denton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Klinck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Glotin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Goëau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-P</forename><surname>Vellinga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Planqué</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes of CLEF 2021 -Conference and Labs of the Evaluation Forum</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Deep residual learning for image recognition</title>
		<author>
			<persName><forename type="first">K</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sun</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1512.03385</idno>
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Efficientnet: Rethinking model scaling for convolutional neural networks</title>
		<author>
			<persName><forename type="first">M</forename><surname>Tan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><forename type="middle">V</forename><surname>Le</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1905.11946</idno>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<author>
			<persName><forename type="first">G</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Van Der Maaten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">Q</forename><surname>Weinberger</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1608.06993</idno>
		<title level="m">Densely connected convolutional networks</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Specaugment: A simple data augmentation method for automatic speech recognition</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">S</forename><surname>Park</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Chan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C.-C</forename><surname>Chiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Zoph</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">D</forename><surname>Cubuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><forename type="middle">V</forename><surname>Le</surname></persName>
		</author>
		<idno type="DOI">10.21437/interspeech.2019-2680</idno>
		<ptr target="http://dx.doi.org/10.21437/Interspeech.2019-2680.doi:10.21437/interspeech.2019-2680" />
	</analytic>
	<monogr>
		<title level="j">Interspeech</title>
		<imprint>
			<date type="published" when="2019">2019. 2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Brock</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>De</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">L</forename><surname>Smith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Simonyan</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2102.06171</idno>
		<title level="m">High-performance large-scale image recognition without normalization</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Exploring the limits of weakly supervised pretraining</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">K</forename><surname>Mahajan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">B</forename><surname>Girshick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Ramanathan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Paluri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bharambe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Van Der Maaten</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ECCV</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cisse</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">N</forename><surname>Dauphin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Lopez-Paz</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1710.09412</idno>
		<title level="m">mixup: Beyond empirical risk minimization</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">P</forename><surname>Kingma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ba</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1412.6980</idno>
		<title level="m">Adam: A method for stochastic optimization</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Park</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Nam</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1712.00866</idno>
		<title level="m">Raw waveform-based audio classification using sample-level cnn architectures</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Convolutional, long short-term memory, fully connected deep neural networks</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">N</forename><surname>Sainath</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Vinyals</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Senior</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Sak</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICASSP.2015.7178838</idno>
	</analytic>
	<monogr>
		<title level="m">2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title>
				<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="4580" to="4584" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Signal-to-noise ratio (snr) as a measure of reproducibility: Design, estimation, and application</title>
		<author>
			<persName><forename type="first">N</forename><surname>Elkum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Shoukri</surname></persName>
		</author>
		<idno type="DOI">10.1007/s10742-008-0030-2</idno>
	</analytic>
	<monogr>
		<title level="j">Health Services and Outcomes Research Methodology</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="page" from="119" to="133" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Défossez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Usunier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Bottou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Bach</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1909.01174</idno>
		<title level="m">Demucs: Deep extractor for music sources with extra unlabeled data remixed</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<title level="m" type="main">Bidirectional lstm-crf models for sequence tagging</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Yu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1508.01991</idno>
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m" type="main">Language modeling with gated convolutional networks</title>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">N</forename><surname>Dauphin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Fan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Auli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Grangier</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1612.08083</idno>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<title level="m" type="main">Attention is all you need</title>
		<author>
			<persName><forename type="first">A</forename><surname>Vaswani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Parmar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">N</forename><surname>Gomez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Kaiser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Polosukhin</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1706.03762</idno>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">F</forename><surname>Agarap</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1803.08375</idno>
		<title level="m">Deep learning using rectified linear units (relu)</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<author>
			<persName><forename type="first">I</forename><surname>Loshchilov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Hutter</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1711.05101</idno>
		<title level="m">Decoupled weight decay regularization</title>
				<imprint/>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
