TUC Media Computing at BirdCLEF 2024: Improving Birdsong Classification Through Single Learning Models Notebook for the Media Computing Lab at CLEF 2024 Arunodhayan Sampath Kumar* , Tobias Schlosser and Danny Kowerko Junior Professorship of Media Computing, Chemnitz University of Technology, 09107 Chemnitz, Germany Abstract This work presents our contribution to the BirdCLEF 2024 competition, aimed at enhancing birdsong classification through the implementation of single learning models. Our primary objective was to develop a robust model capable of processing continuous audio data to accurately identify bird species from their calls. Key challenges included addressing imbalanced training data, managing domain shifts between training and test samples, and processing extensive soundscape recordings within a limited time frame. Our proposed approach leverages transformer-based architectures, incorporating positional encodings to enhance the spatial context of the input spectrograms. Data augmentation techniques were employed to mitigate the effects of noisy labels and domain shifts. The model training process involved the use of various libraries and frameworks, with a focus on optimizing performance through strategies such as cosine annealing and weighted sampling. Our results indicate the potential in improving the accuracy and efficiency of birdsong classification to support avian population monitoring and conservation efforts. From a total of 974 teams that ranked between 45.8 % and 69.0 % in terms of ROC-AUC on the private leaderboard, our best model achieved a ROC-AUC score of 62.9 %, ranking us at 300th place among all submissions. On the public leaderboard, our model achieved a score of 63.9 %. Keywords Audio Classification, Birdsong Soundscapes, Computer Vision and Pattern Recognition, Convolutional Neural Networks (CNN), Data-Efficient Image Transformers (DeiT), Vision Transformers (ViT) 1. Introduction and motivation The BirdCLEF 2024 [1, 2] competition aimed to identify bird species in a global biodiversity hotspot in the Western Ghats, also known as the Sahyadri. The broader goals of BirdCLEF 2024 included developing a deep learning model capable of processing continuous data to recognize bird species by their calls. Specifically, the objectives were to identify endemic bird species in the soundscape data of the sky-islands, detect and classify endangered bird species (species of conservation concern) despite having limited training data, and detect and classify nocturnal bird species, which are currently poorly understood. This year’s primary challenges involved a significant imbalance in the training set, a shift in the domain between the clean training samples and the soundscape test samples, and the time constraint of testing 73.3 hours of diverse soundscape recordings in only 2 hours of time. Nevertheless, advancements in deep learning models are expected to aid in monitoring avian populations, facilitate more effective threat evaluation, and allow for timely adjustments to conservation actions. Ultimately, this will benefit avian populations and support long-term sustainability efforts. CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France * Corresponding author. $ arunodhayan.sampath-kumar@cs.tu-chemnitz.de (A. S. Kumar); tobias.schlosser@cs.tu-chemnitz.de (T. Schlosser); danny.kowerko@cs.tu-chemnitz.de (D. Kowerko) € https://arunodhayan.github.io (A. S. Kumar); https://www.tobias-schlosser.net (T. Schlosser); https://tu-chemnitz.de/cs/mc (D. Kowerko)  0000-0002-2654-7050 (A. S. Kumar); 0000-0002-0682-4284 (T. Schlosser); 0000-0002-2411-8877 (D. Kowerko) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Table 1 Dataset used for model pre-training (IDs 2, 3, and 4) and data augmentation (IDs 5 and 6). ID Name Classes Files 1 BirdCLEF 2024 [5] 182 24 459 2 BirdCLEF 2021 [6] 397 62 874 3 BirdCLEF 2022 [7] 152 14 852 4 BirdCLEF 2023 [8] 264 16 940 5 DCASE [9] 1 22 012 6 ESC-50 [10] 4 500 Total 1000 141637 2. Fundamentals and implementation The implementation of transformer-based bird species recognition presented in this work comprises data preparation, feature extraction, model architecture, data augmentation, and training methods. This work builds on our previous implementations for BirdCLEF 2021 and 2022, which utilized convolutional neural networks (CNN) [3, 4]. 2.1. Data Preparation The BirdCLEF 2024 training set consisted of 24 459 audio recordings provided by xeno-canto [5], covering 182 bird species. The test dataset consisted of 1 100 recordings, each of 4 minutes in length. Table 1 provides an overview of the individual datasets utilized. Datasets with IDs 2, 3, and 4 were used for model pre-training. Datasets with IDs 5 and 6 were employed as a data augmentation technique for background noise. In total, 141 637 recordings across 1 000 classes were collected for training and data augmentation. The BirdCLEF 2024 train set consisted of audio files sampled at 32 kHz using a single channel (mono) and compressed using lossy Ogg compression. Similarly, the datasets used for pre-training and augmentation were converted to 32 kHz and a single channel. The training dataset was weakly labeled, meaning there was no precise information on the presence or absence of the labeled birdsong within the recordings. However, hearing the labeled birdsong at the beginning or the end of each audio file is typically highly probable. These audio files were trimmed for the first 5 and last 5 seconds and saved as NumPy arrays to speed up the data loading procedure since we do not need to load the complete audio file. For training and cross-validation, the dataset is split into 5 stratified folds. For pre-training, we made sure that the species present in BirdCLEF 2024 are not present in our pre-training dataset in order to avoid leakage. 2.2. Feature extraction We used the librosa [11] Python library to convert a 1D audio signal into 2D log-mel spectrograms. The spectrogram representation used the following parameters: • Width (W) = 576 or 256 • Height (H) = 256 or 196 • Sampling rate (SR) = 32 000 Hz • hop_length = 284 • mel_bins = W // 2 • frequency_min = 50 Hz • frequency_max = 16 000 Hz • power = 2.0 • top_db = 100 Figure 1: Input Spectrogram: This comprises a 16 × 16 patch spectrogram arranged in a 24 × 24 grid. The 16 × 16 patch spectrogram has a time axis (x-axis) ranging from 0 to 5 seconds, and a frequency axis (y-axis) ranging from 50 to 16 000 Hz. Algorithm 1 Resizing and rearranging spectrograms method for feature extraction. Require: Input spectrogram 𝑥 of shape either (576, 256) or (256, 196) Ensure: Output tensor 𝑦 of shape either (384, 384) or (224, 224) 1: if 𝑥.shape = (576, 256) then 2: Initialize 𝑦 ← zeros(384, 384, dtype = 𝑥.dtype) 3: 𝑥1 ← 𝑥𝑇 .view(24, 24, 16, 16) 4: for 𝑖 = 0 to 23 do 5: for 𝑗 = 0 to 23 do 6: 𝑦[𝑖 × 16 : (𝑖 + 1) × 16, 𝑗 × 16 : (𝑗 + 1) × 16] ← 𝑥1[𝑖, 𝑗] 7: end for 8: end for 9: else if 𝑥.shape = (256, 196) then 10: Initialize 𝑦 ← zeros(224, 224, dtype = 𝑥.dtype) 11: 𝑥1 ← 𝑥𝑇 .view(14, 16, 16, 16) 12: for 𝑖 = 0 to 13 do 13: for 𝑗 = 0 to 15 do 14: 𝑦[𝑖 × 16 : (𝑖 + 1) × 16, 𝑗 × 16 : (𝑗 + 1) × 16] ← 𝑥1[𝑖, 𝑗] 15: end for 16: end for 17: else 18: Raise error: Unsupported input shape 19: end if return 𝑦 Vision transformers (ViT) oftentimes require spectrograms that are either 384 × 384 or 224 × 224 in size. Resizing spectrograms from 576 × 256 to 384 × 384 or from 256 × 196 to 224 × 224 did not yield any positive results. Therefore, we utilized a resizing and rearranging spectrograms method to ensure that spectrograms are properly provided to the transformer model. The detailed algorithm for this approach is shown in Algorithm 1. Its spectrogram realization is illustrated in the accompanying Fig. 1. Feature Input Positional Feature Linear Block Extraction Spectrogram Encoding Refinement Output (ViT / DeiT) Dropout 1D CNN Batch Norm Swish Activation Figure 2: Block diagram of the BirdCLEF transformer model architecture. 2.3. Model Architecture Figure 2 depicts our model architecture used. It incorporates positional encodings into the input spectrograms to enhance the spatial context of the data. Upon receiving the input spectrogram, an additional singleton dimension is added. A positional encoding is then generated by linearly interpolating values from 0 to 1 along the input height. This encoding is converted to half-precision and reshaped to match the spectrogram dimensions. The positional encoding is concatenated with the input spectrogram along the channel dimension, resulting in a combined input that includes both spectral and positional information. This enhanced input is then passed through the backbone network for feature extraction. The feature extractor used here is ViT / data-efficient image transformers (DeiT) [12, 13]. Post-feature extraction, the first element along the middle dimension is selected to reshape the feature map accordingly. The feature refinement stage processes these reshaped features through a series of operations, including a dropout layer, a 1D convolutional layer, batch normalization, and a swish activation function, to further refine the feature representation. The final classification logits are obtained by passing these refined features through the linear block, a fully connected layer. The output contains the logits, enabling the model to make improved predictions by leveraging spectral and spatial information. 2.4. Data Augmentation Data augmentation techniques were utilized to address the domain shift between the training and test sets, as well as to handle weak and noisy labels. A concise overview of our used augmentation strategies is provided in [14]. These include: • Noise augmentations • Short-noise bursts • General mix up • tanh-based distortion • Gaussian noise • Loudness normalization 2.5. Training Methods In the training process, we used various tools and libraries. The main framework we utilized was the machine learning library PyTorch, along with additional libraries such as the Pytorch Image Models timm for transformer backbone models [15], soundfile [16] and librosa [11] for audio and signal processing, audiomentations for data augmentation, and scikit-learn for calculating metrics and creating stratified folds for training / validation data splits. The depth and number of heads of ViT and DeiT were reduced to decrease the required computation times. This reduction enabled us to omit the usage of pre-trained weights, such as from the ImageNet dataset. As a result, we opted to train the model from scratch using the previous year’s BirdCLEF competition dataset for 70 training epochs. These weights were in turn used as our pre-trained weights Table 2 BirdCLEF 2024 competition results. Note that rank 300 was not the best model submission and is thus not seen in the official ranking. Rank 300 would be the position corresponding to the ROC-AUC of our best transformer-based approach. Submission of TUC was done under the user name Arunodhayan. Rank (private LB) Team name ROC-AUC (private LB) ROC-AUC (public LB) 1 Team Kefir 0.690391 0.738566 2 adsr 0.690354 0.727939 3 NVBird 0.689970 0.742124 4 Team Cerberus 0.687770 0.746911 5 coolz 0.687173 0.743960 300 TUC (Transformer) 0.629135 0.638547 for BirdCLEF 2024. ViT / DeiT require the image to be either 384 × 384 or 224 × 224 pixels in size, for which we extracted non-overlapping 16 × 16 patches, arraigned in a 24 × 24 grid. Initially, we resized our spectrogram to meet these dimensions. However, we encountered disappointing results due to diluted essential features caused by noise domination and misallocation of attention. Therefore, we resized and rearranged our spectrograms as described in Algorithm 1. After performing the spectrogram extraction and model training, the obtained results could be further improved since transformers lack an inherent notion of the order of the input spectrograms. Hence, we introduced a positional encoding block that addresses this by injecting information about the positions of elements in the sequence directly into the model. The model was trained with a learning rate of 5 · 10−4 with a cosine annealing learning rate scheduler to optimize performance over epochs. To prevent overfitting, a weight decay of 1 · 10−6 was applied. The training process utilized a weighted sampler based on the number of samples per class to ensure balanced representation during training. For loss computation, binary cross-entropy with logits loss function was employed, incorporating secondary labels for a more nuanced learning. The model utilized the AdamW optimizer. The number of epochs for training was set to 70, with a batch size of 32. 3. Test results, evaluation, and discussion While our submissions with the name Arunodhayan using EfficientNetB0 ranked 76 and 440 in the public and private leaderboard, respectively, in our report documentation, we will focus on systematic studies obtained with vision transformers. We achieved a macro average ROC score of 63 % on the public test set (public leaderboard) and 62 % on the private test set (private leaderboard) by utilizing vision transformers, excluding classes with no true positive labels. The top-ranked team achieved a score of 69 % on the private test set. Our best transformer model would have achieved a rank of 300 among 972 participants on the private leaderboard (LB) (team TUC in Table 2). The main goal of this competition was to evaluate the test set in under 2 hours of runtime without using a graphics processing units. The primary challenge we faced was the inconsistency in Kaggle’s hardware, which made it difficult to predict the runtime of our models. Therefore, we could not complete the test runs without altering our transformer models when the execution time exceeded 2 hours. To address this issue, we converted our model via OpenVINO [17]. After making the submission, the execution time was reduced to 117 minutes, and we achieved a score of 61 %. Subsequently, we experimented with reducing the number of depth and heads of the transformer models, which further reduced the inference time. The respectively obtained results are presented in Table 2. Table 3 Experiment overview. Entries with “-” could not be completed due to runtime requirements. Testing time Positional ID Description Depth Head CV Private LB Public LB (minutes) encoding M1 ViT_tiny_384 12 3 314 0.68 False - - M1.1 ViT_tiny_384 12 3 314 0.74 True - - M2 ViT_tiny_224 12 3 263 0.69 True - - M3 ViT_tiny_384 10 3 219 0.71 True - - M4 ViT_tiny_224 10 3 130 0.66 True - - M5 ViT_tiny_384 8 3 119 0.70 True 0.595397 0.582498 M6 ViT_tiny_224 8 3 94 0.68 True 0.596327 0.581465 M7 ViT_tiny_384 8 2 72 0.72 True 0.596547 0.586987 M7.1 ViT_tiny_384 8 2 72 0.72 True 0.629135 0.638547 M8 ViT_tiny_224 8 2 55 0.58 True 0.592785 0.571654 M9 ViT_tiny_384 6 2 44 0.63 True 0.602487 0.598741 M10 DeiT_tiny_224 8 2 65 0.69 True 0.609154 0.597128 M11 DeiT_tiny_384 6 2 62 0.72 True 0.618745 0.602874 Ablation study Table 3 outlines the various aspects and approaches related to model performance we investigated. M1 and M1.1 are the baseline models, both supporting a resolution of 384 × 384. Unfortunately, we encountered issues in submitting these models, as the inference time exceeded the allotted time frame. Our cross-validation (CV) results improved macro-averaged mean average precision (cmAP) when positional encoding was introduced. We then attempted to utilize ViT, which operates at a resolution of 224 × 224. With the inclusion of positional encoding, we achieved a CV score of 69 %. However, our submission failed due to the inability to complete the execution within the allocated time frame. To address this, we experimented by reducing the complexity of the ViT transformer model, resulting in models M3 to M11. Model M5 was successfully submitted, although we experienced inconsistent Kaggle hardware, leading to unsuccessful runs with the same model. Subsequently, we opted to reduce the head of the transformer to 2, which proved to be successful. Model M5 served as our baseline, and we fine-tuned the model through data augmentation. Traditionally, adding soundscape noise as data augmentation has improved performance. However, this year’s soundscape noise augmentation led to a drop in model performance. Consequently, we decided to incorporate noise augmentations from the ESC-50 dataset, focusing on insects, cars, human noises, and non-bird events from the DCASE 2018 dataset. Following this principle, we utilized an unlabeled soundscape dataset as pseudo labels using the M7 transformer model, leading to an overall performance increase by 4 % (model M7.1). 4. Conclusion and outlook The BirdCLEF 2024 competition presented a unique challenge compared to its previous years. The use of the metric “a macro-averaged ROC-AUC that excludes classes with no true positive labels” made it difficult to compare cross-validation results and leaderboard scores, Furthermore, the importance of developing efficient models that achieve a good balance between accuracy and speed was highlighted. Our work explored the use of vision transformers and data-efficient image transformers as well as the optimization of these models to run on CPU hardware by converting them via OpenVINO. Since transformers require standard shapes such as 384 × 384 or 224 × 224, directly resizing the spectrograms to the desired shape did not show the expected impact. As an alternative, we attempted to resize and reshape the spectrograms by organizing 16 × 16 patch spectrograms into a 24 × 24 grid, which yielded promising results but did not outperform the scores of competing submissions that used ensemble models that included EfficientNets. Due to the reduction in the complexity of the transformer, we could not use pre-trained weights. Therefore, we trained a model using data from a previous competition, facilitating faster model convergence. The model’s efficiency depends on the data’s quantity and quality. In BirdCLEF 2024, we trained a model using a dataset recorded with high-quality microphones. However, when we tested it using a soundscape dataset recorded with long-distance microphones, there was a domain shift, meaning the test set was not representative of the training dataset. To address this, we added non-bird event sounds to the training set as an augmentation, which made it more similar to the test set and improved model performance. References [1] S. Kahl, T. Denton, H. Klinck, V. Ramesh, V. Joshi, M. Srivathsa, A. Anand, C. Arvind, H. CP, S. Sawant, V. V. Robin, H. Glotin, H. Goëau, W.-P. Vellinga, R. Planqué, A. Joly, Overview of BirdCLEF 2024: Acoustic identification of under-studied bird species in the western ghats, Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum (2024). [2] A. Joly, L. Picek, S. Kahl, H. Goëau, V. Espitalier, C. Botella, B. Deneu, D. Marcos, J. Estopinan, C. Leblanc, T. Larcher, M. Šulc, M. Hrúz, M. Servajean, et al., Overview of lifeclef 2024: Challenges on species distribution prediction and identification, in: International Conference of the Cross- Language Evaluation Forum for European Languages, Springer, 2024. [3] A. Sampathkumar, D. Kowerko, Tuc media computing at birdclef 2021: Noise augmentation strategies in bird sound classification in combination with densenets and resnets, 2021. URL: https://arxiv.org/abs/2106.10856, arXiv preprint arXiv:2106.10856. [4] A. Sampathkumar, D. Kowerko, Tuc media computing at birdclef 2022: Strategies in identifying bird sounds in complex acoustic environments, in: Proceedings of the Working Notes of CLEF, 2022, pp. 2189–2198. [5] H. Klinck, M. Sohier, D. S. Kahl, T. Denton, V. Ramesh, Birdclef 2024, https://kaggle.com/ competitions/birdclef-2024, 2024. [6] A. Howard, A. Joly, H. Klinck, S. Dane, S. Kahl, T. Denton, Birdclef 2021 - birdcall identification, 2021. URL: https://kaggle.com/competitions/birdclef-2021, kaggle. [7] A. Howard, A. Navine, H. Klinck, S. Dane, S. Kahl, T. Denton, Birdclef 2022, 2022. URL: https: //kaggle.com/competitions/birdclef-2022, kaggle. [8] H. Klinck, S. Dane, S. Kahl, T. Denton, Birdclef 2023, 2023. URL: https://kaggle.com/competitions/ birdclef-2023, kaggle. [9] D. Stowell, Y. Stylianou, M. Wood, H. Pamuła, H. Glotin, Automatic acoustic detection of birds through deep learning: the first bird audio detection challenge, Methods in Ecology and Evolution (2018). URL: https://arxiv.org/abs/1807.05812. arXiv:1807.05812. [10] K. J. Piczak, Esc: Dataset for environmental sound classification, in: Proceedings of the 23rd ACM International Conference on Multimedia, ACM, 2015, pp. 1015–1018. doi:10.1145/2733373. 2806390. [11] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, O. Nieto, librosa: Audio and music signal analysis in python, in: Proceedings of the 14th python in science conference, volume 8, 2015. [12] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An image is worth 16x16 words: Transformers for image recognition at scale, CoRR abs/2010.11929 (2020). URL: https://arxiv.org/ abs/2010.11929. arXiv:2010.11929. [13] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, H. Jégou, Training data-efficient image transformers & distillation through attention, CoRR abs/2012.12877 (2020). URL: https: //arxiv.org/abs/2012.12877. arXiv:2012.12877. [14] A. S. Kumar, T. Schlosser, S. Kahl, D. Kowerko, Improving learning-based birdsong classification by utilizing combined audio augmentation strategies, Ecological Informatics (2024) 102699. URL: https://www.sciencedirect.com/science/article/pii/S1574954124002413. doi:https://doi.org/ 10.1016/j.ecoinf.2024.102699. [15] R. Wightman, Pytorch image models, https://github.com/rwightman/pytorch-image-models, 2019. doi:10.5281/zenodo.4414861. [16] M. R. Gogins, Soundfile, https://pypi.org/project/soundfile/, 2024. Version 0.12.1. [17] O. T. Contributors, Openvino™ toolkit, 2024. URL: https://github.com/openvinotoolkit/openvino, accessed: 2024-06-20.