Bird-Species Audio Identification, Ensembling of EfficientNet-B0 and Pre-trained EfficientNet-B1 model Notebook for the Lab at CLEF 2024 Aaditya Porwal1 1 Indian Institute of Technology, Dhanbad, India Abstract In this study, I present a novel approach to audio classification, specifically for the BirdCLEF 2024 challenge, by employing an ensemble of EfficientNet models. My methodology integrates EfficientNet-B0, trained exclusively on the current competition’s data, and EfficientNet-B1, pre-trained on datasets from previous BirdCLEF competitions. The EfficientNet-B0 model leverages heavy augmentation techniques to enhance generalization and robustness. Data preprocessing involves transforming audio signals into Mel spectrograms, optimized through feature engineering and augmentation methods. The ensemble strategy, combining predictions from both models, achieves superior performance compared to individual models. My results demonstrate the efficacy of this approach, with significant improvements in classification accuracy and robustness, exemplified by achieving the 25th rank out of 975 competitors on the BirdCLEF 2024 leaderboard. Keywords Deep Learning, Bird Species Classification, Transfer Learning, Attention Mechanism, Sound Detection, Audio Source Detection, EfficientNet, Ensembling, Audio Classification, BirdCLEF, Ensemble Learning, Data Augmenta- tion, Mel Spectrogram, Convolutional Neural Network, Feature Engineering, ROC-AUC 1. Introduction There are about 10,000 different bird species in this world, and they all play an important role in the natural world. Birds are excellent indicators of biodiversity change since they are highly mobile and have diverse habitat requirements. BirdCLEF 2024 [1] is a Kaggle competition organized by The Cornell Lab of Ornithology in collaboration with LifeCLEF 2024 [2], whose challenge is to identify which birds are calling in long recordings, given training data generated in meaningfully different contexts. The BirdCLEF 2024 competition focuses on identifying bird calls in long recordings, particularly from the sky-islands of the Western Ghats. This competition presents significant challenges, including imbalanced training data per species, domain shifts between training and test data, and a strict two-hour time limit for analyzing extensive recordings. This paper is structured to first provide details of the competition and the given data to ensure a clear understanding of the challenges posed by the train and test data. Additionally, I will provide a detailed solution to the approaches used for this challenge, including data preparation, approach, augmentations, model building, training procedures, post-processing techniques, and conclusion. If successful, this effort will advance ongoing initiatives to protect avian biodiversity in the Western Ghats, India. 2. Data 2.1. Training Data • Train audio: The bulk of the training data consists of short recordings of individual bird calls from xeno-canto.org. These files have been downsampled to 32 kHz where applicable to match the test set audio and converted to the ogg format. Information of 182 unique species has been given. CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France $ aadityaporwal234@gmail.com (A. Porwal) € https://github.com/AADI-234 (A. Porwal) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings • Train metadata: Along with audio files, metadata is also provided which consists of primary label, secondary labels, type, latitude, longitude, scientific name, common name, author, filename, license, rating, and URL. • Unlabeled soundscapes: Unlabeled audio from the same locations as the test soundscapes is provided. • eBird_Taxonomy_v2021.csv: Contains data on species relationships. 2.2. Test Data Test_soundscapes: The test_soundscapes directory will be populated with approximately 1,100 record- ings to be used for scoring. They are 4 minutes long and in ogg audio format. 3. My Approach My final approach to this dataset was to ensemble two different EfficientNets (B0 and B1) [3]. The B1 model [4] was conditioned on data from previous years BirdCLEF competitions during its pretraining stage, while the B0 model [5] was trained using only this competition’s data. 4. EfficientNet-B0 using heavy augmentation 4.1. Overview The task of audio classification involves identifying and categorizing audio signals into predefined classes. In this project, I employed EfficientNet-B0, a highly efficient convolutional neural network [6], to classify audio signals. EfficientNet-B0 is chosen because of its balance of performance and computational efficiency. The audio data is preprocessed and transformed into Mel spectrograms, which are fed into the EfficientNet-B0 model. The model is trained using a variety of data augmentation techniques to enhance generalization and robustness. The classification performance is optimized through advanced feature engineering, cross-validation, and careful selection of hyperparameters. 4.2. Data Preparation The audio data used in this project is sampled at a rate of 32,000 samples per second (sr = 32000). Each audio clip for training has a duration of 30 seconds. To handle the diverse length of audio files, a fixed length of 30 seconds is set for all training samples. From these 30-second clips, random 5-second segments are extracted for Short-Time Fourier Transform (STFT) processing [7]. For testing, a uniform duration of 5 seconds is used. More details are provided in 1. Table 1 Audio data parameters and spectrogram configuration Parameter Value Sample Rate 32000 Clip Duration 30 sec STFT Segment 5 sec Frequency Range 20 Hz - 15,000 Hz Mel Bands 128 FFT Components 1024 Spectrogram Time Axis 512 4.3. Feature Engineering • Feature engineering focuses on transforming raw audio data (acoustic signal) into a format suitable for model training. Mel spectrograms are generated from the audio signals, converting them into a visual representation of frequency content over time. This transformation is achieved using Short-Time Fourier Transform (STFT) [8], where each 5-second audio slice undergoes Fourier transformation to capture the spectral properties. An illustration of this process can be seen in Figure 1. • Additionally, various augmentation techniques are employed to enhance the training data. Spectrogram-specific augmentations like masking and coarse dropout [9] are used, further di- versifying the training data. Mixup augmentation [10], both in waveform and spectrogram forms, combines multiple examples to create synthetic training samples, enhancing the model’s robustness and generalization capabilities. Figure 1: The process of extracting the Mel spectrogram from an acoustic signal. Source: ResearchGate Figure 2: Logarithmic magnitude spectrogram showing frequency content over time using STFT. Source: Kaggle 4.4. Model Building The model architecture is based on EfficientNet-B0. It incorporates average pooling (pool_type = ’avg’) and utilizes Binary Cross-Entropy with Focal Loss (loss_type = "BCEFocalLoss") to address class imbalance. More details are provided in Table 2. Table 2 EfficientNet-B0 model configurations and training parameters Component Configuration Model EfficientNet-B0 Pooling Average Pooling Loss Function Binary Cross-Entropy with Focal Loss Optimizer adan Learning Rate 1.0e-03 Weight Decay 1.0e-02 Early Stopping Patience = 5 epochs Mixed Precision Automatic Mixed Precision (AMP) enabled 4.5. Augmentation Data augmentation plays a crucial role in improving the model’s generalization. Various augmentation techniques are employed to introduce variability in the training data. These include: • Waveform Augmentation: – Random Noise Addition: This technique involves adding random noise to the audio signal to simulate real-world audio conditions and improve the model’s robustness to background noise. – Gain Adjustments: Adjusting the volume levels to handle recordings with different gain levels. This helps the model generalize better across recordings with varying loudness. – Pitch Shifting: This technique was considered but ultimately set to 0 because it disrupted the distinct frequency patterns critical for bird call classification. – Time Shifting: Similarly, time shifting was considered but set to 0 as it disrupted the temporal patterns essential for accurate classification. • Spectrogram Augmentation: – Masking Parts of the Spectrogram: Techniques like spectrogram masking were employed to hide parts of the spectrogram, making the model more robust to missing data. However, this did not show significant improvements, due to the distinct and critical nature of the bird call frequency patterns. – Randomly Dropping Coarse Regions: This technique, known as coarse dropout, was used to randomly drop regions of the spectrogram. It aimed to make the model more robust, but preliminary experiments showed limited effectiveness. – Horizontal Flipping: Not used because flipping the time axis of a spectrogram is not meaningful for audio data and would disrupt the temporal sequence of the sound. • Mixup Augmentation: Both waveform and spectrogram mixup techniques were employed. This involves linearly combining two examples to create a new synthetic example, enhancing the model’s robustness and generalization capabilities. This was controlled by parameters such as: – Waveform Mixup (aug_wave_mixup): Set to 1.0. This parameter indicates that wave- form mixup was applied to all audio samples during training. Waveform mixup involves combining the waveforms of two different audio samples by taking a weighted average of their amplitudes. – Spectrogram Mixup (aug_spec_mixup): Set to 0.0. This parameter indicates that spec- trogram mixup was not applied to the spectrograms during training. Spectrogram mixup would involve combining the spectrogram representations of two different audio samples in a similar manner to waveform mixup. – Probability of applying Spectrogram Mixup (aug_spec_mixup_prob): Set to 0.5. This parameter defines the probability with which spectrogram mixup would be applied to an audio sample during training. Even though spectrogram mixup was set to 0.0 in this case, this parameter would control the likelihood of its application if it were used. The mix ratio for combining the samples was determined by a Beta distribution with 𝛼 = 0.95. The Beta distribution is commonly used in mixup techniques to control the interpolation between two samples. A Beta distribution with 𝛼 = 0.95 produces mix ratios that are generally close to 0 or 1, meaning that the synthetic samples are predominantly composed of one of the original samples, with only a small contribution from the other. This helps maintain the distinct features of each sample while still providing the benefits of data augmentation. 4.6. Training Procedure The training procedure involves cross-validation, early stopping, and augmentation strategies. The training configurations are provided in Table 3. Table 3 Training procedure parameters and configurations Parameter Configuration Cross-Validation 5 folds Training Epochs Max 9 epochs, Augmentation for first 6 epochs Batch Size Training: 32, Validation: 1 Oversampling Threshold: 60 Mixup Function Enabled Spectral Mixup Enabled The mixup function combines two examples in the dataset to create a new example, enhancing generalization. Spectral mixup further diversifies training data by combining spectrograms from different audio samples. These strategies help the model learn robust and generalized representations, improving classification performance. 5. EfficientNet-B1 with Pre-Training 5.1. Overview The task of audio classification involves categorizing audio signals into predefined classes. In this project, I developed a model for identifying bird calls using TensorFlow and the EfficientNet-B1 architecture, drawing inspiration from previous work on pre-training by Awsaf (Md Awsafur Rahman) [11].The model was pre-trained on BirdCLEF datasets from 2021-2023 and Xeno-Canto Extend, and fine-tuned on BirdCLEF 2024 data to enhance transfer learning. Advanced audio processing and feature extraction techniques were employed to optimize performance on TPU devices, addressing challenges such as spectrogram augmentation and effective transfer learning. 5.2. Data Preparation Data Sources: BirdCLEF datasets from 2021-2024 and Xeno-Canto Extend [12] [13] [14] [15] [16] [17]. The raw audio data, stored in .ogg format, is efficiently handled using the ‘tf.data‘ API. Each audio clip is sampled at 32,000 Hz and has a uniform duration of 10 seconds. To effectively capture the audio features, the spectrogram parameters are carefully chosen. The frequency range spans from 20 Hz to 16,000 Hz. The Short-Time Fourier Transform (STFT) parameters include an FFT window size of 2028 and a spectrogram window size of 2048. These configurations ensure that the critical audio characteristics are well-represented for model training. More details are provided in Table 4. Table 4 Audio data parameters and spectrogram configurations for EfficientNet-B1 Parameter Value Sample Rate 32000 Clip Duration 10 sec Frequency Range 20 Hz - 16,000 Hz FFT Window Size 2028 Spectrogram Window Size 2048 5.3. Feature Engineering Feature engineering focuses on transforming raw audio data into a format suitable for model training. Mel spectrograms are generated from the audio signals, converting them into a visual representation of frequency content over time. This transformation is achieved using the ‘MelSpectrogram‘ layer, where each 10-second audio slice undergoes Fourier transformation to capture the spectral properties. To improve the model’s robustness, I apply various augmentation techniques on the spectrograms: • Time and Frequency Masking: Randomly masks parts of the spectrogram in both time and frequency dimensions. • Normalization: Standardizes the data using mean and standard deviation, followed by rescaling to the [0, 1] range. 5.4. Model Building The model architecture is based on EfficientNet-B1, a convolutional neural network known for its efficiency and performance. Key configurations include: • Pretraining: Initialized with ImageNet weights to leverage transfer learning. The Final Activation Function used is Softmax for multi-class classification. Filter Stride Reduction (FSR) used for reducing the stride in the stem block. The model incorporates several custom layers using TensorFlow library to handle specific tasks. This can also be implemented using PyTorch. • MelSpectrogram Layer: Converts audio signals into Mel spectrograms. • TimeFreqMask Layer: Applies time and frequency masking for spectrogram augmentation. • ZScoreMinMax Layer: Standardizes and rescales the spectrogram data. • MixUp and CutMix Layers: Augment the training data by mixing audio samples. 5.5. Augmentation Data augmentation plays a crucial role in improving the model’s generalization. Various augmentation techniques are employed to introduce variability in the training data: • Audio Augmentation: – Gaussian Noise Addition: This technique was chosen to simulate different environmental noise conditions and make the model robust to noise. Applied with a probability of 0.5, it adds random noise to the audio signal, improving the model’s ability to generalize to noisy data. – Time Shifting: Shifting the audio signal in time to introduce variability, but set to 0 as it disrupted the temporal patterns essential for accurate classification. – MixUp: This technique involves mixing two audio signals to create a synthetic example, applied with a probability of 0.65. This helps the model generalize better by providing varied training examples. – CutMix: Similar to MixUp, CutMix combines two audio signals but by cutting and pasting parts of each, also applied with a probability of 0.65. • Spectrogram Augmentation: – Time and Frequency Masking: This technique was chosen to make the model more robust to missing data by randomly masking parts of the spectrogram in both time and frequency dimensions, applied with a probability of 0.5. It helps the model learn to handle occlusions and missing parts in the data. – Normalization: The ‘ZScoreMinMax‘ layer standardizes the spectrogram data using mean and standard deviation, followed by rescaling to the [0, 1] range. This ensures that the data fed into the model is on a consistent scale, improving learning stability. Effectiveness of Augmentation Techniques: The pre-trained EfficientNet-B1 model benefited more from time and frequency masking techniques compared to the EfficientNet-B0 model. This is likely because the EfficientNet-B1 model had already learned general audio features during its pre-training phase, making it more adaptable to variations introduced by augmentations. The EfficientNet-B0 model, lacking this pre-trained knowledge, struggled with the same augmentations as it was still learning fundamental patterns from the current dataset. 5.6. Training Procedure The training procedure involves a carefully designed pipeline to ensure effective learning and general- ization. The dataset is stratified into five folds for cross-validation, with classes with very few samples always included in the training set to address class imbalance. Upsampling is employed to ensure that minority classes are adequately represented in the training data. The training configurations are provided in Table 5. Table 5 Training procedure parameters and configurations Parameter Configuration Cross-Validation 5 folds Batch Size 32 Learning Rate 1e-3 (cosine scheduler) Optimizer Adam Loss Function Categorical Cross Entropy with label smoothing (0.05) Early Stopping Patience: 5 epochs The model is trained on a TPU device, utilizing the TPU-VM for automatic device selection and training acceleration. Early stopping is implemented to prevent overfitting, with a patience of 5 epochs. The model’s performance is evaluated using the padded cmAP (macro-averaged average precision) score, which accounts for class imbalance and zero true positive labels for certain species. Additionally, the Precision Recall (PR) curve is used as the primary metric for AUC evaluation. 6. Post Processing Post-processing is integral to refining model predictions and enhancing overall classification per- formance. The ensemble method combines the predictions of two distinct models to leverage their individual strengths, thereby enhancing robustness and accuracy. 6.1. Ensemble Strategy To achieve optimal classification results, I implemented an ensemble method where the final predic- tion for each audio clip is computed as a weighted average of predictions from EfficientNet-B0 and EfficientNet-B1. The ensemble weights were empirically determined based on cross-validation perfor- mance: 0.6 for EfficientNet-B0 and 0.4 for EfficientNet-B1. This weighting scheme balances the unique capabilities of each model effectively. Final Prediction = 0.6 × PredictionsEfficientNet-B0 + 0.4 × PredictionsEfficientNet-B1 This weighted average helps in smoothing out the variances and combining the high-confidence predictions from each model. This approach effectively leverages the complementary strengths of EfficientNet-B0 and EfficientNet-B1, providing a comprehensive solution for bird classification in the BirdCLEF24 challenge. 7. Results The following table summarizes the performance of various models and strategies evaluated in this study, measured by their private and public scores on the BirdCLEF 2024 competition leaderboard. The scores represent the macro-averaged ROC-AUC [18], accounting for class imbalance and providing a robust measure of model performance. Table 6 Model Descriptions and Performance Scores Model Descriptions Private Score Public Score EfficientNet-B0 0.649998 0.654307 EfficientNet-B1 with Pretraining 0.596740 0.632268 Models Ensemble 0.652743 0.663388 7.1. Analysis • EfficientNet-B0: The EfficientNet-B0 model [5] demonstrated consistent performance, achieving a private score of 0.649998 and a public score of 0.654307. This balanced performance across both datasets indicates its robustness and generalization capabilities in handling the BirdCLEF dataset. • EfficientNet-B1 with Pretraining: The EfficientNet-B1 model [4], which was enhanced with pretraining, showed a slight dip in the private score (0.596740) compared to its public score (0.632268). This suggests that while pretraining improved its performance on the public dataset, it may have led to some overfitting or less effective generalization on the private dataset. • Models Ensemble: The ensemble of models [19] achieved the highest scores, with a private score of 0.652743 and a public score of 0.663388. This approach effectively leveraged the strengths of multiple models, leading to better overall performance and indicating the effectiveness of model ensembling in complex tasks such as bird species identification from audio recordings. Overall, the results were very stable, with a correlation of 0.96 between public and private scores [20]. However, there were significant changes visible in the public and private leaderboards. 8. Conclusion The ensemble of EfficientNet-B0 and EfficientNet-B1 models proved to be an effective strategy for the BirdCLEF 2024 challenge, outperforming individual models and enhancing overall classification performance. EfficientNet-B0, trained with heavy data augmentation, and EfficientNet-B1, pre-trained on historical BirdCLEF datasets, complemented each other well. The ensemble approach leveraged the strengths of both models, achieving a balance between computational efficiency and classification accuracy. My findings underscore the importance of combining diverse data sources and robust augmentation techniques in building resilient audio classification systems. Future work could explore further optimization of ensemble weights and the incorporation of additional data augmentation methods to continue improving performance in audio classification tasks. 9. Acknowledgments I would like to thank Stefan Kahl, Willem-Pier Vellinga, Tom Denton, Holger Klinck, and Hervé Glotin for their exceptional leadership and expertise throughout the BirdCLEF24 competition. I am also immensely grateful to the collaborating institutions—Kaggle, Chemnitz University of Technology, Google Research, the K. Lisa Yang Center for Conservation Bioacoustics at the Cornell Lab of Ornithology, the Indian Institute of Science Education and Research (IISER) Tirupati, LifeCLEF, and Xeno-canto—for providing invaluable resources, data, and support. Their collective contributions have been crucial to the success of this project. References [1] S. Kahl, T. Denton, H. Klinck, V. Ramesh, V. Joshi, M. Srivathsa, A. Anand, C. Arvind, H. CP, S. Sawant, V. V. Robin, H. Glotin, H. Goëau, W.-P. Vellinga, R. Planqué, A. Joly, Overview of BirdCLEF 2024: Acoustic identification of under-studied bird species in the western ghats, Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum (2024). [2] A. Joly, L. Picek, S. Kahl, H. Goëau, V. Espitalier, C. Botella, B. Deneu, D. Marcos, J. Estopinan, C. Leblanc, T. Larcher, M. Šulc, M. Hrúz, M. Servajean, et al., Overview of lifeclef 2024: Challenges on species distribution prediction and identification, in: International Conference of the Cross- Language Evaluation Forum for European Languages, Springer, 2024. [3] M. Tan, Q. V. Le, Efficientnet: Rethinking model scaling for convolutional neural networks, arXiv preprint arXiv:1905.11946 (2020). URL: https://arxiv.org/abs/1905.11946. [4] A. Aikhmelnytskyy, Birdclef24 pretraining is all you need - infer, 2024. URL: https://www.kaggle. com/code/aikhmelnytskyy/birdclef24-pretraining-is-all-you-need-infer. [5] TC0000, Birdclef starter notebook, 2024. URL: https://www.kaggle.com/code/tc0000/ birdclef-starter-notebook. [6] K. O’Shea, R. Nash, An introduction to convolutional neural networks, arXiv preprint arXiv:1511.08458 (2015). URL: https://arxiv.org/abs/1511.08458. [7] Nicholson, Birdclef 2024: Spectrograms imagenet run, 2024. URL: https: //www.kaggle.com/code/richolson/birdclef-2024-spectrograms-imagenet-run# Initialize-submit-DF-with-correct-columns. [8] G. Zhou, Y. Zhang, J. Pan, Z. Han, Short-time fourier transform with the window size fixed in the frequency domain (2017). URL: https://www.researchgate.net/publication/321043608_Short-Time_ Fourier_Transform_with_the_Window_Size_Fixed_in_the_Frequency_Domain. [9] C. Muljana, T.-P. M. Luo, A review of online course dropout research: Implications for practice and future research, ResearchGate (2019). URL: https://www.researchgate.net/publication/227246914_ A_review_of_online_course_dropout_research_Implications_for_practice_and_future_research. [10] J. Yosinski, J. Clune, Y. Bengio, H. Lipson, How transferable are features in deep neural networks?, arXiv preprint arXiv:1411.1792 (2014). URL: https://arxiv.org/pdf/1411.1792. [11] Awsaf49, Birdclef23 pretraining is all you need - train, 2023. URL: https://www.kaggle.com/code/ awsaf49/birdclef23-pretraining-is-all-you-need-train. [12] Kaggle, Birdclef 2021 competition, 2021. URL: https://www.kaggle.com/competitions/birdclef-2021. [13] Kaggle, Birdclef 2022 competition, 2022. URL: https://www.kaggle.com/competitions/birdclef-2022. [14] Kaggle, Birdclef 2023 competition, 2023. URL: https://www.kaggle.com/competitions/birdclef-2023. [15] Kaggle, Birdclef 2024 competition, 2024. URL: https://www.kaggle.com/competitions/birdclef-2024. [16] R. Rao, Xeno-canto bird recordings extended a-m, 2024. URL: https://www.kaggle.com/datasets/ rohanrao/xeno-canto-bird-recordings-extended-a-m. [17] R. Rao, Xeno-canto bird recordings extended n-z, 2024. URL: https://www.kaggle.com/datasets/ rohanrao/xeno-canto-bird-recordings-extended-n-z. [18] Metric, Birdclef roc auc, 2024. URL: https://www.kaggle.com/code/metric/birdclef-roc-auc. [19] A. Porwal, Silver medal solution - 25th place, 2024. URL: https://www.kaggle.com/code/ aadityaporwal/silver-medal-solution-25th-place. [20] Correlation between public and private scores, 2024. URL: https://www.kaggle.com/competitions/ birdclef-2024/discussion/512197.