1. Introduction and motivation

TUC Media Computing at BirdCLEF 2024: Improving Birdsong Classification Through Single Learning Models

Arunodhayan Sampath Kumar

Tobias Schlosser

Danny Kowerko

0 0 Junior Professorship of Media Computing, Chemnitz University of Technology , 09107 Chemnitz , Germany

This work presents our contribution to the BirdCLEF 2024 competition, aimed at enhancing birdsong classification through the implementation of single learning models. Our primary objective was to develop a robust model capable of processing continuous audio data to accurately identify bird species from their calls. Key challenges included addressing imbalanced training data, managing domain shifts between training and test samples, and processing extensive soundscape recordings within a limited time frame. Our proposed approach leverages transformer-based architectures, incorporating positional encodings to enhance the spatial context of the input spectrograms. Data augmentation techniques were employed to mitigate the efects of noisy labels and domain shifts. The model training process involved the use of various libraries and frameworks, with a focus on optimizing performance through strategies such as cosine annealing and weighted sampling. Our results indicate the potential in improving the accuracy and eficiency of birdsong classification to support avian population monitoring and conservation eforts. From a total of 974 teams that ranked between 45.8 % and 69.0 % in terms of ROC-AUC on the private leaderboard, our best model achieved a ROC-AUC score of 62.9 %, ranking us at 300th place among all submissions. On the public leaderboard, our model achieved a score of 63.9 %.

eol>Audio Classification Birdsong Soundscapes Computer Vision and Pattern Recognition Convolutional Neural Networks (CNN) Data-Eficient Image Transformers (DeiT) Vision Transformers (ViT)

1. Introduction and motivation

The BirdCLEF 2024 [ 1, 2 ] competition aimed to identify bird species in a global biodiversity hotspot in the Western Ghats, also known as the Sahyadri. The broader goals of BirdCLEF 2024 included developing a deep learning model capable of processing continuous data to recognize bird species by their calls. Specifically, the objectives were to identify endemic bird species in the soundscape data of the sky-islands, detect and classify endangered bird species (species of conservation concern) despite having limited training data, and detect and classify nocturnal bird species, which are currently poorly understood.

This year’s primary challenges involved a significant imbalance in the training set, a shift in the domain between the clean training samples and the soundscape test samples, and the time constraint of testing 73.3 hours of diverse soundscape recordings in only 2 hours of time. Nevertheless, advancements in deep learning models are expected to aid in monitoring avian populations, facilitate more efective threat evaluation, and allow for timely adjustments to conservation actions. Ultimately, this will benefit avian populations and support long-term sustainability eforts.

2. Fundamentals and implementation

The implementation of transformer-based bird species recognition presented in this work comprises data preparation, feature extraction, model architecture, data augmentation, and training methods. This work builds on our previous implementations for BirdCLEF 2021 and 2022, which utilized convolutional neural networks (CNN) [ 3, 4 ].

2.1. Data Preparation

The BirdCLEF 2024 training set consisted of 24 459 audio recordings provided by xeno-canto [ 5 ], covering 182 bird species. The test dataset consisted of 1 100 recordings, each of 4 minutes in length. Table 1 provides an overview of the individual datasets utilized. Datasets with IDs 2, 3, and 4 were used for model pre-training. Datasets with IDs 5 and 6 were employed as a data augmentation technique for background noise. In total, 141 637 recordings across 1 000 classes were collected for training and data augmentation. The BirdCLEF 2024 train set consisted of audio files sampled at 32 kHz using a single channel (mono) and compressed using lossy Ogg compression. Similarly, the datasets used for pre-training and augmentation were converted to 32 kHz and a single channel. The training dataset was weakly labeled, meaning there was no precise information on the presence or absence of the labeled birdsong within the recordings. However, hearing the labeled birdsong at the beginning or the end of each audio file is typically highly probable. These audio files were trimmed for the first 5 and last 5 seconds and saved as NumPy arrays to speed up the data loading procedure since we do not need to load the complete audio file. For training and cross-validation, the dataset is split into 5 stratified folds. For pre-training, we made sure that the species present in BirdCLEF 2024 are not present in our pre-training dataset in order to avoid leakage.

2.2. Feature extraction

We used the librosa [ 11 ] Python library to convert a 1D audio signal into 2D log-mel spectrograms. The spectrogram representation used the following parameters: • Width (W) = 576 or 256 • Height (H) = 256 or 196 • Sampling rate (SR) = 32 000 Hz • hop_length = 284 • mel_bins = W // 2 • frequency_min = 50 Hz • frequency_max = 16 000 Hz • power = 2.0 • top_db = 100

Algorithm 1 Resizing and rearranging spectrograms method for feature extraction. Require: Input spectrogram of shape either (576, 256) or (256, 196) Ensure: Output tensor of shape either (384, 384) or (224, 224) 1: if .shape = (576, 256) then 2: Initialize ← zeros(384, 384, dtype = .dtype) 3: 1 ← .view(24, 24, 16, 16) 4: for = 0 to 23 do 5: for = 0 to 23 do 6: [ × 16 : ( + 1) × 16, × 16 : ( + 1) × 16] ← 1[, ] 7: end for 8: end for 9: else if .shape = (256, 196) then 10: Initialize ← zeros(224, 224, dtype = .dtype) 11: 1 ← .view(14, 16, 16, 16) 12: for = 0 to 13 do 13: for = 0 to 15 do 14: [ × 16 : ( + 1) × 16, × 16 : ( + 1) × 16] ← 1[, ] 15: end for 16: end for 17: else 18: Raise error: Unsupported input shape 19: end if

return

Vision transformers (ViT) oftentimes require spectrograms that are either 384 × 384 or 224 × 224 in size. Resizing spectrograms from 576 × 256 to 384 × 384 or from 256 × 196 to 224 × 224 did not yield any positive results. Therefore, we utilized a resizing and rearranging spectrograms method to ensure that spectrograms are properly provided to the transformer model. The detailed algorithm for this approach is shown in Algorithm 1. Its spectrogram realization is illustrated in the accompanying Fig. 1.

Input Spectrogram Dropout Feature Extraction

(ViT / DeiT)

Feature Refinement Linear Block Output

1D CNN

Batch Norm Swish Activation 2.3. Model Architecture

Figure 2 depicts our model architecture used. It incorporates positional encodings into the input spectrograms to enhance the spatial context of the data. Upon receiving the input spectrogram, an additional singleton dimension is added. A positional encoding is then generated by linearly interpolating values from 0 to 1 along the input height. This encoding is converted to half-precision and reshaped to match the spectrogram dimensions. The positional encoding is concatenated with the input spectrogram along the channel dimension, resulting in a combined input that includes both spectral and positional information. This enhanced input is then passed through the backbone network for feature extraction. The feature extractor used here is ViT / data-eficient image transformers (DeiT) [ 12, 13 ]. Post-feature extraction, the first element along the middle dimension is selected to reshape the feature map accordingly. The feature refinement stage processes these reshaped features through a series of operations, including a dropout layer, a 1D convolutional layer, batch normalization, and a swish activation function, to further refine the feature representation. The final classification logits are obtained by passing these refined features through the linear block, a fully connected layer. The output contains the logits, enabling the model to make improved predictions by leveraging spectral and spatial information.

2.4. Data Augmentation

Data augmentation techniques were utilized to address the domain shift between the training and test sets, as well as to handle weak and noisy labels. A concise overview of our used augmentation strategies is provided in [14]. These include: • Noise augmentations • Short-noise bursts • General mix up • tanh-based distortion • Gaussian noise • Loudness normalization

2.5. Training Methods

In the training process, we used various tools and libraries. The main framework we utilized was the machine learning library PyTorch, along with additional libraries such as the Pytorch Image Models timm for transformer backbone models [15], soundfile [ 16] and librosa [ 11 ] for audio and signal processing, audiomentations for data augmentation, and scikit-learn for calculating metrics and creating stratified folds for training / validation data splits.

The depth and number of heads of ViT and DeiT were reduced to decrease the required computation times. This reduction enabled us to omit the usage of pre-trained weights, such as from the ImageNet dataset. As a result, we opted to train the model from scratch using the previous year’s BirdCLEF competition dataset for 70 training epochs. These weights were in turn used as our pre-trained weights for BirdCLEF 2024. ViT / DeiT require the image to be either 384 × 384 or 224 × 224 pixels in size, for which we extracted non-overlapping 16 × 16 patches, arraigned in a 24 × 24 grid. Initially, we resized our spectrogram to meet these dimensions. However, we encountered disappointing results due to diluted essential features caused by noise domination and misallocation of attention. Therefore, we resized and rearranged our spectrograms as described in Algorithm 1. After performing the spectrogram extraction and model training, the obtained results could be further improved since transformers lack an inherent notion of the order of the input spectrograms. Hence, we introduced a positional encoding block that addresses this by injecting information about the positions of elements in the sequence directly into the model. The model was trained with a learning rate of 5 · 10− 4 with a cosine annealing learning rate scheduler to optimize performance over epochs. To prevent overfitting, a weight decay of 1 · 10− 6 was applied. The training process utilized a weighted sampler based on the number of samples per class to ensure balanced representation during training. For loss computation, binary cross-entropy with logits loss function was employed, incorporating secondary labels for a more nuanced learning. The model utilized the AdamW optimizer. The number of epochs for training was set to 70, with a batch size of 32.

3. Test results, evaluation, and discussion

While our submissions with the name Arunodhayan using EficientNetB0 ranked 76 and 440 in the public and private leaderboard, respectively, in our report documentation, we will focus on systematic studies obtained with vision transformers. We achieved a macro average ROC score of 63 % on the public test set (public leaderboard) and 62 % on the private test set (private leaderboard) by utilizing vision transformers, excluding classes with no true positive labels. The top-ranked team achieved a score of 69 % on the private test set. Our best transformer model would have achieved a rank of 300 among 972 participants on the private leaderboard (LB) (team TUC in Table 2).

The main goal of this competition was to evaluate the test set in under 2 hours of runtime without using a graphics processing units. The primary challenge we faced was the inconsistency in Kaggle’s hardware, which made it dificult to predict the runtime of our models. Therefore, we could not complete the test runs without altering our transformer models when the execution time exceeded 2 hours. To address this issue, we converted our model via OpenVINO [17]. After making the submission, the execution time was reduced to 117 minutes, and we achieved a score of 61 %. Subsequently, we experimented with reducing the number of depth and heads of the transformer models, which further reduced the inference time. The respectively obtained results are presented in Table 2. 314 314 263 219 130 119 94 72 72 55 44 65 62

4. Conclusion and outlook

The BirdCLEF 2024 competition presented a unique challenge compared to its previous years. The use of the metric “a macro-averaged ROC-AUC that excludes classes with no true positive labels” made it dificult to compare cross-validation results and leaderboard scores, Furthermore, the importance of developing eficient models that achieve a good balance between accuracy and speed was highlighted. Our work explored the use of vision transformers and data-eficient image transformers as well as the optimization of these models to run on CPU hardware by converting them via OpenVINO. Since transformers require standard shapes such as 384 × 384 or 224 × 224, directly resizing the spectrograms to the desired shape did not show the expected impact. As an alternative, we attempted to resize and reshape the spectrograms by organizing 16 × 16 patch spectrograms into a 24 × 24 grid, which yielded promising results but did not outperform the scores of competing submissions that used ensemble models that included EficientNets. Due to the reduction in the complexity of the transformer, we could not use pre-trained weights. Therefore, we trained a model using data from a previous competition, facilitating faster model convergence. The model’s eficiency depends on the data’s quantity and quality. In BirdCLEF 2024, we trained a model using a dataset recorded with high-quality microphones. However, when we tested it using a soundscape dataset recorded with long-distance microphones, there was a domain shift, meaning the test set was not representative of the training dataset. To address this, we added non-bird event sounds to the training set as an augmentation, which made it more similar to the test set and improved model performance. [14] A. S. Kumar, T. Schlosser, S. Kahl, D. Kowerko, Improving learning-based birdsong classification by utilizing combined audio augmentation strategies, Ecological Informatics (2024) 102699. URL: https://www.sciencedirect.com/science/article/pii/S1574954124002413. doi:https://doi.org/ 10.1016/j.ecoinf.2024.102699. [15] R. Wightman, Pytorch image models, https://github.com/rwightman/pytorch-image-models, 2019.

doi:10.5281/zenodo.4414861. [16] M. R. Gogins, Soundfile, https://pypi.org/project/soundfile/, 2024. Version 0.12.1. [17] O. T. Contributors, Openvino™ toolkit, 2024. URL: https://github.com/openvinotoolkit/openvino, accessed: 2024-06-20.

[1]

Kahl ,

Denton ,

Klinck ,

Ramesh ,

Joshi ,

Srivathsa ,

Anand ,

Arvind ,

CP ,

Sawant ,

V. V.

Robin ,

Glotin ,

Goëau ,

W.-P.

Vellinga ,

Planqué ,

Joly , Overview of BirdCLEF 2024: Acoustic identification of under-studied bird species in the western ghats , Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum ( 2024 ).

[2]

Joly ,

Picek ,

Kahl ,

Goëau ,

Espitalier ,

Botella ,

Deneu ,

Marcos ,

Estopinan ,

Leblanc ,

Larcher ,

Šulc ,

Hrúz ,

Servajean , et al., Overview of lifeclef 2024 : Challenges on species distribution prediction and identification , in: International Conference of the CrossLanguage Evaluation Forum for European Languages , Springer, 2024 .

[3]

Sampathkumar ,

Kowerko , Tuc media computing at birdclef 2021: Noise augmentation strategies in bird sound classification in combination with densenets and resnets, 2021 . URL: https://arxiv.org/abs/2106.10856, arXiv preprint arXiv: 2106 . 10856 .

[4]

Sampathkumar ,

Kowerko , Tuc media computing at birdclef 2022: Strategies in identifying bird sounds in complex acoustic environments , in: Proceedings of the Working Notes of CLEF , 2022 , pp. 2189 - 2198 .

[5]

Klinck ,

Sohier ,

D. S.

Kahl ,

Denton ,

Ramesh , Birdclef 2024 , https://kaggle.com/ competitions/birdclef-2024, 2024 .

[6]

Howard ,

Joly ,

Klinck ,

Dane ,

Kahl , T. Denton, Birdclef 2021 - birdcall identification, 2021 . URL: https://kaggle.com/competitions/birdclef-2021, kaggle.

[7]

Howard ,

Navine ,

Klinck ,

Dane ,

Kahl , T. Denton, Birdclef 2022 , 2022 . URL: https: //kaggle.com/competitions/birdclef-2022, kaggle.

[8]

Klinck ,

Dane ,

Kahl , T. Denton, Birdclef 2023 , 2023 . URL: https://kaggle.com/competitions/ birdclef-2023, kaggle.

[9]

Stowell ,

Stylianou ,

Wood ,

Pamuła ,

Glotin , Automatic acoustic detection of birds through deep learning: the first bird audio detection challenge, Methods in Ecology and Evolution ( 2018 ). URL: https://arxiv.org/abs/ 1807 .05812. arXiv: 1807 .05812.

[10] K. J. Piczak , Esc: Dataset for environmental sound classification , in: Proceedings of the 23rd ACM International Conference on Multimedia, ACM , 2015 , pp. 1015 - 1018 . doi: 10 .1145/2733373. 2806390.

[11]

McFee ,

Rafel ,

Liang ,

D. P.

Ellis ,

McVicar ,

Battenberg , O. Nieto, librosa: Audio and music signal analysis in python , in: Proceedings of the 14th python in science conference , volume 8 , 2015 .

[12]

Dosovitskiy ,

Beyer ,

Kolesnikov ,

Weissenborn ,

Zhai ,

Unterthiner ,

Dehghani ,

Minderer , G. Heigold,

Gelly ,

Uszkoreit ,

Houlsby , An image is worth 16x16 words: Transformers for image recognition at scale , CoRR abs/ 2010 .11929 ( 2020 ). URL: https://arxiv.org/ abs/ 2010 .11929. arXiv: 2010 .11929.

[13]

Touvron ,

Cord ,

Douze ,

Massa ,

Sablayrolles ,

Jégou , Training data-eficient image transformers & distillation through attention , CoRR abs/ 2012 .12877 ( 2020 ). URL: https: //arxiv.org/abs/ 2012 .12877. arXiv: 2012 .12877.