=Paper=
{{Paper
|id=Vol-3740/paper-204
|storemode=property
|title=Bird-Species Audio Identification, Ensembling of EfficientNet-B0 and Pre-trained EfficientNet-B1
model
|pdfUrl=https://ceur-ws.org/Vol-3740/paper-204.pdf
|volume=Vol-3740
|authors=Aaditya Porwal
|dblpUrl=https://dblp.org/rec/conf/clef/Porwal24
}}
==Bird-Species Audio Identification, Ensembling of EfficientNet-B0 and Pre-trained EfficientNet-B1
model==
Bird-Species Audio Identification, Ensembling of
EfficientNet-B0 and Pre-trained EfficientNet-B1 model
Notebook for the Lab at CLEF 2024
Aaditya Porwal1
1
Indian Institute of Technology, Dhanbad, India
Abstract
In this study, I present a novel approach to audio classification, specifically for the BirdCLEF 2024 challenge, by
employing an ensemble of EfficientNet models. My methodology integrates EfficientNet-B0, trained exclusively on
the current competition’s data, and EfficientNet-B1, pre-trained on datasets from previous BirdCLEF competitions.
The EfficientNet-B0 model leverages heavy augmentation techniques to enhance generalization and robustness.
Data preprocessing involves transforming audio signals into Mel spectrograms, optimized through feature
engineering and augmentation methods. The ensemble strategy, combining predictions from both models,
achieves superior performance compared to individual models. My results demonstrate the efficacy of this
approach, with significant improvements in classification accuracy and robustness, exemplified by achieving the
25th rank out of 975 competitors on the BirdCLEF 2024 leaderboard.
Keywords
Deep Learning, Bird Species Classification, Transfer Learning, Attention Mechanism, Sound Detection, Audio
Source Detection, EfficientNet, Ensembling, Audio Classification, BirdCLEF, Ensemble Learning, Data Augmenta-
tion, Mel Spectrogram, Convolutional Neural Network, Feature Engineering, ROC-AUC
1. Introduction
There are about 10,000 different bird species in this world, and they all play an important role in the
natural world. Birds are excellent indicators of biodiversity change since they are highly mobile and
have diverse habitat requirements. BirdCLEF 2024 [1] is a Kaggle competition organized by The Cornell
Lab of Ornithology in collaboration with LifeCLEF 2024 [2], whose challenge is to identify which
birds are calling in long recordings, given training data generated in meaningfully different contexts.
The BirdCLEF 2024 competition focuses on identifying bird calls in long recordings, particularly from
the sky-islands of the Western Ghats. This competition presents significant challenges, including
imbalanced training data per species, domain shifts between training and test data, and a strict two-hour
time limit for analyzing extensive recordings.
This paper is structured to first provide details of the competition and the given data to ensure a clear
understanding of the challenges posed by the train and test data. Additionally, I will provide a detailed
solution to the approaches used for this challenge, including data preparation, approach, augmentations,
model building, training procedures, post-processing techniques, and conclusion. If successful, this
effort will advance ongoing initiatives to protect avian biodiversity in the Western Ghats, India.
2. Data
2.1. Training Data
• Train audio: The bulk of the training data consists of short recordings of individual bird calls
from xeno-canto.org. These files have been downsampled to 32 kHz where applicable to match
the test set audio and converted to the ogg format. Information of 182 unique species has been
given.
CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
$ aadityaporwal234@gmail.com (A. Porwal)
https://github.com/AADI-234 (A. Porwal)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
• Train metadata: Along with audio files, metadata is also provided which consists of primary
label, secondary labels, type, latitude, longitude, scientific name, common name, author, filename,
license, rating, and URL.
• Unlabeled soundscapes: Unlabeled audio from the same locations as the test soundscapes is
provided.
• eBird_Taxonomy_v2021.csv: Contains data on species relationships.
2.2. Test Data
Test_soundscapes: The test_soundscapes directory will be populated with approximately 1,100 record-
ings to be used for scoring. They are 4 minutes long and in ogg audio format.
3. My Approach
My final approach to this dataset was to ensemble two different EfficientNets (B0 and B1) [3]. The B1
model [4] was conditioned on data from previous years BirdCLEF competitions during its pretraining
stage, while the B0 model [5] was trained using only this competition’s data.
4. EfficientNet-B0 using heavy augmentation
4.1. Overview
The task of audio classification involves identifying and categorizing audio signals into predefined
classes. In this project, I employed EfficientNet-B0, a highly efficient convolutional neural network
[6], to classify audio signals. EfficientNet-B0 is chosen because of its balance of performance and
computational efficiency.
The audio data is preprocessed and transformed into Mel spectrograms, which are fed into the
EfficientNet-B0 model. The model is trained using a variety of data augmentation techniques to enhance
generalization and robustness. The classification performance is optimized through advanced feature
engineering, cross-validation, and careful selection of hyperparameters.
4.2. Data Preparation
The audio data used in this project is sampled at a rate of 32,000 samples per second (sr = 32000). Each
audio clip for training has a duration of 30 seconds. To handle the diverse length of audio files, a
fixed length of 30 seconds is set for all training samples. From these 30-second clips, random 5-second
segments are extracted for Short-Time Fourier Transform (STFT) processing [7]. For testing, a uniform
duration of 5 seconds is used. More details are provided in 1.
Table 1
Audio data parameters and spectrogram configuration
Parameter Value
Sample Rate 32000
Clip Duration 30 sec
STFT Segment 5 sec
Frequency Range 20 Hz - 15,000 Hz
Mel Bands 128
FFT Components 1024
Spectrogram Time Axis 512
4.3. Feature Engineering
• Feature engineering focuses on transforming raw audio data (acoustic signal) into a format
suitable for model training. Mel spectrograms are generated from the audio signals, converting
them into a visual representation of frequency content over time. This transformation is achieved
using Short-Time Fourier Transform (STFT) [8], where each 5-second audio slice undergoes
Fourier transformation to capture the spectral properties. An illustration of this process can be
seen in Figure 1.
• Additionally, various augmentation techniques are employed to enhance the training data.
Spectrogram-specific augmentations like masking and coarse dropout [9] are used, further di-
versifying the training data. Mixup augmentation [10], both in waveform and spectrogram
forms, combines multiple examples to create synthetic training samples, enhancing the model’s
robustness and generalization capabilities.
Figure 1: The process of extracting the Mel spectrogram from an acoustic signal. Source: ResearchGate
Figure 2: Logarithmic magnitude spectrogram showing frequency content over time using STFT. Source: Kaggle
4.4. Model Building
The model architecture is based on EfficientNet-B0. It incorporates average pooling (pool_type =
’avg’) and utilizes Binary Cross-Entropy with Focal Loss (loss_type = "BCEFocalLoss") to address class
imbalance. More details are provided in Table 2.
Table 2
EfficientNet-B0 model configurations and training parameters
Component Configuration
Model EfficientNet-B0
Pooling Average Pooling
Loss Function Binary Cross-Entropy with Focal Loss
Optimizer adan
Learning Rate 1.0e-03
Weight Decay 1.0e-02
Early Stopping Patience = 5 epochs
Mixed Precision Automatic Mixed Precision (AMP) enabled
4.5. Augmentation
Data augmentation plays a crucial role in improving the model’s generalization. Various augmentation
techniques are employed to introduce variability in the training data. These include:
• Waveform Augmentation:
– Random Noise Addition: This technique involves adding random noise to the audio signal
to simulate real-world audio conditions and improve the model’s robustness to background
noise.
– Gain Adjustments: Adjusting the volume levels to handle recordings with different gain
levels. This helps the model generalize better across recordings with varying loudness.
– Pitch Shifting: This technique was considered but ultimately set to 0 because it disrupted
the distinct frequency patterns critical for bird call classification.
– Time Shifting: Similarly, time shifting was considered but set to 0 as it disrupted the
temporal patterns essential for accurate classification.
• Spectrogram Augmentation:
– Masking Parts of the Spectrogram: Techniques like spectrogram masking were employed
to hide parts of the spectrogram, making the model more robust to missing data. However,
this did not show significant improvements, due to the distinct and critical nature of the
bird call frequency patterns.
– Randomly Dropping Coarse Regions: This technique, known as coarse dropout, was
used to randomly drop regions of the spectrogram. It aimed to make the model more robust,
but preliminary experiments showed limited effectiveness.
– Horizontal Flipping: Not used because flipping the time axis of a spectrogram is not
meaningful for audio data and would disrupt the temporal sequence of the sound.
• Mixup Augmentation: Both waveform and spectrogram mixup techniques were employed.
This involves linearly combining two examples to create a new synthetic example, enhancing the
model’s robustness and generalization capabilities. This was controlled by parameters such as:
– Waveform Mixup (aug_wave_mixup): Set to 1.0. This parameter indicates that wave-
form mixup was applied to all audio samples during training. Waveform mixup involves
combining the waveforms of two different audio samples by taking a weighted average of
their amplitudes.
– Spectrogram Mixup (aug_spec_mixup): Set to 0.0. This parameter indicates that spec-
trogram mixup was not applied to the spectrograms during training. Spectrogram mixup
would involve combining the spectrogram representations of two different audio samples in
a similar manner to waveform mixup.
– Probability of applying Spectrogram Mixup (aug_spec_mixup_prob): Set to 0.5. This
parameter defines the probability with which spectrogram mixup would be applied to an
audio sample during training. Even though spectrogram mixup was set to 0.0 in this case,
this parameter would control the likelihood of its application if it were used.
The mix ratio for combining the samples was determined by a Beta distribution with 𝛼 = 0.95. The
Beta distribution is commonly used in mixup techniques to control the interpolation between two
samples. A Beta distribution with 𝛼 = 0.95 produces mix ratios that are generally close to 0 or 1,
meaning that the synthetic samples are predominantly composed of one of the original samples, with
only a small contribution from the other. This helps maintain the distinct features of each sample while
still providing the benefits of data augmentation.
4.6. Training Procedure
The training procedure involves cross-validation, early stopping, and augmentation strategies. The
training configurations are provided in Table 3.
Table 3
Training procedure parameters and configurations
Parameter Configuration
Cross-Validation 5 folds
Training Epochs Max 9 epochs, Augmentation for first 6 epochs
Batch Size Training: 32, Validation: 1
Oversampling Threshold: 60
Mixup Function Enabled
Spectral Mixup Enabled
The mixup function combines two examples in the dataset to create a new example, enhancing
generalization. Spectral mixup further diversifies training data by combining spectrograms from
different audio samples. These strategies help the model learn robust and generalized representations,
improving classification performance.
5. EfficientNet-B1 with Pre-Training
5.1. Overview
The task of audio classification involves categorizing audio signals into predefined classes. In this project,
I developed a model for identifying bird calls using TensorFlow and the EfficientNet-B1 architecture,
drawing inspiration from previous work on pre-training by Awsaf (Md Awsafur Rahman) [11].The
model was pre-trained on BirdCLEF datasets from 2021-2023 and Xeno-Canto Extend, and fine-tuned
on BirdCLEF 2024 data to enhance transfer learning. Advanced audio processing and feature extraction
techniques were employed to optimize performance on TPU devices, addressing challenges such as
spectrogram augmentation and effective transfer learning.
5.2. Data Preparation
Data Sources: BirdCLEF datasets from 2021-2024 and Xeno-Canto Extend [12] [13] [14] [15] [16] [17].
The raw audio data, stored in .ogg format, is efficiently handled using the ‘tf.data‘ API.
Each audio clip is sampled at 32,000 Hz and has a uniform duration of 10 seconds. To effectively
capture the audio features, the spectrogram parameters are carefully chosen. The frequency range spans
from 20 Hz to 16,000 Hz. The Short-Time Fourier Transform (STFT) parameters include an FFT window
size of 2028 and a spectrogram window size of 2048. These configurations ensure that the critical audio
characteristics are well-represented for model training. More details are provided in Table 4.
Table 4
Audio data parameters and spectrogram configurations for EfficientNet-B1
Parameter Value
Sample Rate 32000
Clip Duration 10 sec
Frequency Range 20 Hz - 16,000 Hz
FFT Window Size 2028
Spectrogram Window Size 2048
5.3. Feature Engineering
Feature engineering focuses on transforming raw audio data into a format suitable for model training.
Mel spectrograms are generated from the audio signals, converting them into a visual representation of
frequency content over time. This transformation is achieved using the ‘MelSpectrogram‘ layer, where
each 10-second audio slice undergoes Fourier transformation to capture the spectral properties.
To improve the model’s robustness, I apply various augmentation techniques on the spectrograms:
• Time and Frequency Masking: Randomly masks parts of the spectrogram in both time and
frequency dimensions.
• Normalization: Standardizes the data using mean and standard deviation, followed by rescaling
to the [0, 1] range.
5.4. Model Building
The model architecture is based on EfficientNet-B1, a convolutional neural network known for its
efficiency and performance. Key configurations include:
• Pretraining: Initialized with ImageNet weights to leverage transfer learning. The Final Activation
Function used is Softmax for multi-class classification. Filter Stride Reduction (FSR) used for
reducing the stride in the stem block.
The model incorporates several custom layers using TensorFlow library to handle specific tasks. This
can also be implemented using PyTorch.
• MelSpectrogram Layer: Converts audio signals into Mel spectrograms.
• TimeFreqMask Layer: Applies time and frequency masking for spectrogram augmentation.
• ZScoreMinMax Layer: Standardizes and rescales the spectrogram data.
• MixUp and CutMix Layers: Augment the training data by mixing audio samples.
5.5. Augmentation
Data augmentation plays a crucial role in improving the model’s generalization. Various augmentation
techniques are employed to introduce variability in the training data:
• Audio Augmentation:
– Gaussian Noise Addition: This technique was chosen to simulate different environmental
noise conditions and make the model robust to noise. Applied with a probability of 0.5, it
adds random noise to the audio signal, improving the model’s ability to generalize to noisy
data.
– Time Shifting: Shifting the audio signal in time to introduce variability, but set to 0 as it
disrupted the temporal patterns essential for accurate classification.
– MixUp: This technique involves mixing two audio signals to create a synthetic example,
applied with a probability of 0.65. This helps the model generalize better by providing varied
training examples.
– CutMix: Similar to MixUp, CutMix combines two audio signals but by cutting and pasting
parts of each, also applied with a probability of 0.65.
• Spectrogram Augmentation:
– Time and Frequency Masking: This technique was chosen to make the model more
robust to missing data by randomly masking parts of the spectrogram in both time and
frequency dimensions, applied with a probability of 0.5. It helps the model learn to handle
occlusions and missing parts in the data.
– Normalization: The ‘ZScoreMinMax‘ layer standardizes the spectrogram data using mean
and standard deviation, followed by rescaling to the [0, 1] range. This ensures that the data
fed into the model is on a consistent scale, improving learning stability.
Effectiveness of Augmentation Techniques: The pre-trained EfficientNet-B1 model benefited
more from time and frequency masking techniques compared to the EfficientNet-B0 model. This is likely
because the EfficientNet-B1 model had already learned general audio features during its pre-training
phase, making it more adaptable to variations introduced by augmentations. The EfficientNet-B0 model,
lacking this pre-trained knowledge, struggled with the same augmentations as it was still learning
fundamental patterns from the current dataset.
5.6. Training Procedure
The training procedure involves a carefully designed pipeline to ensure effective learning and general-
ization. The dataset is stratified into five folds for cross-validation, with classes with very few samples
always included in the training set to address class imbalance. Upsampling is employed to ensure
that minority classes are adequately represented in the training data. The training configurations are
provided in Table 5.
Table 5
Training procedure parameters and configurations
Parameter Configuration
Cross-Validation 5 folds
Batch Size 32
Learning Rate 1e-3 (cosine scheduler)
Optimizer Adam
Loss Function Categorical Cross Entropy with label smoothing (0.05)
Early Stopping Patience: 5 epochs
The model is trained on a TPU device, utilizing the TPU-VM for automatic device selection and
training acceleration. Early stopping is implemented to prevent overfitting, with a patience of 5 epochs.
The model’s performance is evaluated using the padded cmAP (macro-averaged average precision)
score, which accounts for class imbalance and zero true positive labels for certain species. Additionally,
the Precision Recall (PR) curve is used as the primary metric for AUC evaluation.
6. Post Processing
Post-processing is integral to refining model predictions and enhancing overall classification per-
formance. The ensemble method combines the predictions of two distinct models to leverage their
individual strengths, thereby enhancing robustness and accuracy.
6.1. Ensemble Strategy
To achieve optimal classification results, I implemented an ensemble method where the final predic-
tion for each audio clip is computed as a weighted average of predictions from EfficientNet-B0 and
EfficientNet-B1. The ensemble weights were empirically determined based on cross-validation perfor-
mance: 0.6 for EfficientNet-B0 and 0.4 for EfficientNet-B1. This weighting scheme balances the unique
capabilities of each model effectively.
Final Prediction = 0.6 × PredictionsEfficientNet-B0 + 0.4 × PredictionsEfficientNet-B1
This weighted average helps in smoothing out the variances and combining the high-confidence
predictions from each model. This approach effectively leverages the complementary strengths of
EfficientNet-B0 and EfficientNet-B1, providing a comprehensive solution for bird classification in the
BirdCLEF24 challenge.
7. Results
The following table summarizes the performance of various models and strategies evaluated in this
study, measured by their private and public scores on the BirdCLEF 2024 competition leaderboard. The
scores represent the macro-averaged ROC-AUC [18], accounting for class imbalance and providing a
robust measure of model performance.
Table 6
Model Descriptions and Performance Scores
Model Descriptions Private Score Public Score
EfficientNet-B0 0.649998 0.654307
EfficientNet-B1 with Pretraining 0.596740 0.632268
Models Ensemble 0.652743 0.663388
7.1. Analysis
• EfficientNet-B0: The EfficientNet-B0 model [5] demonstrated consistent performance, achieving
a private score of 0.649998 and a public score of 0.654307. This balanced performance across both
datasets indicates its robustness and generalization capabilities in handling the BirdCLEF dataset.
• EfficientNet-B1 with Pretraining: The EfficientNet-B1 model [4], which was enhanced with
pretraining, showed a slight dip in the private score (0.596740) compared to its public score
(0.632268). This suggests that while pretraining improved its performance on the public dataset,
it may have led to some overfitting or less effective generalization on the private dataset.
• Models Ensemble: The ensemble of models [19] achieved the highest scores, with a private
score of 0.652743 and a public score of 0.663388. This approach effectively leveraged the strengths
of multiple models, leading to better overall performance and indicating the effectiveness of
model ensembling in complex tasks such as bird species identification from audio recordings.
Overall, the results were very stable, with a correlation of 0.96 between public and private scores
[20]. However, there were significant changes visible in the public and private leaderboards.
8. Conclusion
The ensemble of EfficientNet-B0 and EfficientNet-B1 models proved to be an effective strategy for
the BirdCLEF 2024 challenge, outperforming individual models and enhancing overall classification
performance. EfficientNet-B0, trained with heavy data augmentation, and EfficientNet-B1, pre-trained
on historical BirdCLEF datasets, complemented each other well. The ensemble approach leveraged
the strengths of both models, achieving a balance between computational efficiency and classification
accuracy. My findings underscore the importance of combining diverse data sources and robust
augmentation techniques in building resilient audio classification systems. Future work could explore
further optimization of ensemble weights and the incorporation of additional data augmentation
methods to continue improving performance in audio classification tasks.
9. Acknowledgments
I would like to thank Stefan Kahl, Willem-Pier Vellinga, Tom Denton, Holger Klinck, and Hervé Glotin for
their exceptional leadership and expertise throughout the BirdCLEF24 competition. I am also immensely
grateful to the collaborating institutions—Kaggle, Chemnitz University of Technology, Google Research,
the K. Lisa Yang Center for Conservation Bioacoustics at the Cornell Lab of Ornithology, the Indian
Institute of Science Education and Research (IISER) Tirupati, LifeCLEF, and Xeno-canto—for providing
invaluable resources, data, and support. Their collective contributions have been crucial to the success
of this project.
References
[1] S. Kahl, T. Denton, H. Klinck, V. Ramesh, V. Joshi, M. Srivathsa, A. Anand, C. Arvind, H. CP,
S. Sawant, V. V. Robin, H. Glotin, H. Goëau, W.-P. Vellinga, R. Planqué, A. Joly, Overview of
BirdCLEF 2024: Acoustic identification of under-studied bird species in the western ghats, Working
Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum (2024).
[2] A. Joly, L. Picek, S. Kahl, H. Goëau, V. Espitalier, C. Botella, B. Deneu, D. Marcos, J. Estopinan,
C. Leblanc, T. Larcher, M. Šulc, M. Hrúz, M. Servajean, et al., Overview of lifeclef 2024: Challenges
on species distribution prediction and identification, in: International Conference of the Cross-
Language Evaluation Forum for European Languages, Springer, 2024.
[3] M. Tan, Q. V. Le, Efficientnet: Rethinking model scaling for convolutional neural networks, arXiv
preprint arXiv:1905.11946 (2020). URL: https://arxiv.org/abs/1905.11946.
[4] A. Aikhmelnytskyy, Birdclef24 pretraining is all you need - infer, 2024. URL: https://www.kaggle.
com/code/aikhmelnytskyy/birdclef24-pretraining-is-all-you-need-infer.
[5] TC0000, Birdclef starter notebook, 2024. URL: https://www.kaggle.com/code/tc0000/
birdclef-starter-notebook.
[6] K. O’Shea, R. Nash, An introduction to convolutional neural networks, arXiv preprint
arXiv:1511.08458 (2015). URL: https://arxiv.org/abs/1511.08458.
[7] Nicholson, Birdclef 2024: Spectrograms imagenet run, 2024. URL: https:
//www.kaggle.com/code/richolson/birdclef-2024-spectrograms-imagenet-run#
Initialize-submit-DF-with-correct-columns.
[8] G. Zhou, Y. Zhang, J. Pan, Z. Han, Short-time fourier transform with the window size fixed in the
frequency domain (2017). URL: https://www.researchgate.net/publication/321043608_Short-Time_
Fourier_Transform_with_the_Window_Size_Fixed_in_the_Frequency_Domain.
[9] C. Muljana, T.-P. M. Luo, A review of online course dropout research: Implications for practice and
future research, ResearchGate (2019). URL: https://www.researchgate.net/publication/227246914_
A_review_of_online_course_dropout_research_Implications_for_practice_and_future_research.
[10] J. Yosinski, J. Clune, Y. Bengio, H. Lipson, How transferable are features in deep neural networks?,
arXiv preprint arXiv:1411.1792 (2014). URL: https://arxiv.org/pdf/1411.1792.
[11] Awsaf49, Birdclef23 pretraining is all you need - train, 2023. URL: https://www.kaggle.com/code/
awsaf49/birdclef23-pretraining-is-all-you-need-train.
[12] Kaggle, Birdclef 2021 competition, 2021. URL: https://www.kaggle.com/competitions/birdclef-2021.
[13] Kaggle, Birdclef 2022 competition, 2022. URL: https://www.kaggle.com/competitions/birdclef-2022.
[14] Kaggle, Birdclef 2023 competition, 2023. URL: https://www.kaggle.com/competitions/birdclef-2023.
[15] Kaggle, Birdclef 2024 competition, 2024. URL: https://www.kaggle.com/competitions/birdclef-2024.
[16] R. Rao, Xeno-canto bird recordings extended a-m, 2024. URL: https://www.kaggle.com/datasets/
rohanrao/xeno-canto-bird-recordings-extended-a-m.
[17] R. Rao, Xeno-canto bird recordings extended n-z, 2024. URL: https://www.kaggle.com/datasets/
rohanrao/xeno-canto-bird-recordings-extended-n-z.
[18] Metric, Birdclef roc auc, 2024. URL: https://www.kaggle.com/code/metric/birdclef-roc-auc.
[19] A. Porwal, Silver medal solution - 25th place, 2024. URL: https://www.kaggle.com/code/
aadityaporwal/silver-medal-solution-25th-place.
[20] Correlation between public and private scores, 2024. URL: https://www.kaggle.com/competitions/
birdclef-2024/discussion/512197.