1. Introduction

Transfer Learning with Pseudo Multi-Label Birdcall Classification for DS@GT BirdCLEF 2024

Anthony Miyaguchi

Adrian Cheung

Murilo Gustineli

Ashley Kim

0 0 Georgia Institute of Technology , North Ave NW, Atlanta, GA 30332 , USA

We present working notes for the DS@GT team on transfer learning with pseudo multi-label birdcall classification for the BirdCLEF 2024 competition, focused on identifying Indian bird species in recorded soundscapes. Our approach utilizes production-grade models such as the Google Bird Vocalization Classifier, BirdNET, and EnCodec to address representation and labeling challenges in the competition. We explore the distributional shift between this year's edition of unlabeled soundscapes representative of the hidden test set and propose a pseudo multi-label classification strategy to leverage the unlabeled data. Our highest postcompetition public leaderboard score is 0.63 using BirdNET embeddings with Bird Vocalization pseudo-labels. Our code is available at github.com/dsgt-kaggle-clef/birdclef-2024.

eol>Transfer Learning Dataset Annotation Embeddings Association Rule Mining Google Bird Vocalization Classifier BirdNET EnCodec CEUR-WS

1. Introduction 2. Birdcall Classification Overview

Birdcall classification is a challenging task due to the variability in bird vocalizations, the presence of background noise, and the large number of species to classify. In addition, the measured data comes in the form of audio recordings, which are high-dimensional and require specialized processing techniques. Many successful approaches to birdcall classification utilize convolutional neural networks (CNNs) to extract features from audio spectrograms. Audio spectrograms are a time-frequency representation of the audio signal, which are extracted using the short-time Fourier transform (STFT) [ 4 ], often with additional preprocessing steps such as mel-frequency scaling. The spectrograms are represented as 2D images, which utilize techniques in the rich literature surrounding image classification.

BirdNET is a popular birdcall classification model that utilizes the spectrogram-CNN approach. It is widely distributed in the field due to its high accuracy and ease of use on mobile devices [ 5 ]. The Google Bird Vocalization Classifier is another model using EficientNet-B1, a similar CNN architecture, and is trained on many soundscapes. It was released alongside the BirdCLEF 2023 competition and has more than 10,000 species in its output space [ 6 ].

3. Domain Knowledge Transfer via Embedding Spaces

Neural networks are universal function approximators that map some input space to an arbitrary output space [ 7 ]. A neural network’s intermediate layers can be considered a manifold that typically projects high dimensional data to lower dimensional spaces. Transfer learning is a technique that leverages the learned representations of a model trained on one task to improve the performance of a model on a diferent task. In the context of birdcall classifications, raw audio data is projected onto a manifold optimized to discriminate between bird species. The projections are called embeddings and can be used as features in downstream tasks to transfer knowledge to a new but similar domain. Few-shot learning on global bird embeddings tends to be efective on new domains as per Ghani et al. [ 8 ].

We visually inspect learned embeddings by projecting them into two or three dimensions to reveal meaningful semantic structures that can be qualitatively interpreted by humans. We use PaCMAP, a dimensionality reduction technique that preserves local and global structure on a manifold via pairwise relationships [ 9 ], to visualize embeddings from the Bird Vocalization Classifier, BirdNET, and EnCodec. In Figure 1, we project the top five species by frequency in the soundscape dataset. The first two are domain-specific models trained on bird vocalizations, while the latter is a general-purpose neural audio codec. EnCodec is particularly interesting because it is not trained on bird vocalizations but rather a diverse set of audio data using a self-supervised learning objective guided by self-attention mechanisms [ 10 ]. It is also perceptually optimized for audio compression, preserving meaningful structures in the embedding space, such as bird vocalizations over static background noise.

We explore the efectiveness of transfer learning, where bird vocalization predictions are surrogates for true labels. We hypothesize that transfer learning is an efective technique for the competition because existing models capture underlying structures amenable to optimization by simple linear classifiers. We quantify how well we can learn domain-specific adaptations between diferent embedding spaces and how well each of these models is suited to capture the underlying structure of the data.

4. Exploratory Data Analysis

We perform an exploratory analysis on the training and unlabeled soundscape datasets to understand species distribution and their co-occurrence patterns. We hypothesize a shift in species distribution between the training and unlabeled soundscape datasets due to the diferences in the recording environments observed in downstream domain knowledge transfer. The training dataset contains 182 species obtained from crowd-sourced Xeno-Canto recordings. Because they are crowd-sourced, the training data is likely biased toward clear and distinct vocalizations that typically occur in isolation. The soundscape dataset is a collection of 4-minute soundscapes from Western Ghats, India, representative of the hidden test set in the competition [ 11 ]. The soundscape is likely to be more intermittent and contain vocalizations that are less distinct and overlapping due to the lack of human-directed attention to the recording process.

We use the Bird Vocalization model to extract the embeddings and logits from the datasets in 5second intervals. The training dataset comprises 217k discrete intervals totaling 302 hours, while the soundscape dataset has 407k intervals totaling 566 hours. We assume that an interval contains a call if the maximum logit value run through a sigmoid function exceeds a threshold of 0.5. The training dataset has a higher density of calls, with 62% of intervals containing at least one call, compared to 8.8% in the soundscape dataset.

We compute the relative frequency of species overall discrete intervals and compare their distributions by the ranked frequency in the soundscape dataset in Figure 2. There is a notable discordance between the two distributions, with many species in the training set unrepresented in the unlabeled soundscapes. We observe that top species in each dataset do not align in Table 1 using the raw frequency of occurrences.

We use a frequent-pattern mining algorithm, FPGrowth [ 12 ], to identify co-occurrence patterns in the soundscape dataset. In Table 2, we observe that co-occurrences of species can appear more often than individual species alone. The frequent itemsets give us a rough estimate of how many birds we can expect to see in a single recording. In Figure 3, we plot the distribution of normalized itemset sizes in the training and soundscape datasets. We observe an approximately normal distribution of sizes centered around four to six species per recording. The training set is skewed toward smaller itemsets, likely due to biases in the data collection process where individuals are more likely to record and upload isolated calls.

5. Methodology

The experiments are run over several modular stages. We implement an end-to-end workflow that applies domain-specific fine-tuning to state-of-the-art models for birdcall classification. We quantify diferences between choices of dataset, architecture, and training losses. The second part of the experiment focuses on transfer learning using the model as a surrogate, where the model’s predictions are used as labels for transfer learning on audio classification models. In particular, we study a widely distributed birdcall-specific convolutional neural network and a self-supervised neural audio codec for encoding and decoding.

5.1. Transfer Learning

The Google Bird Vocalization Classification model is the main surrogate transfer learning experiment, focused on version 4 of google/bird-vocalization-classifier on Kaggle, corresponding to version 1.3 on the TensorFlow hub. We directly compute a prediction for the competition by selecting the competition subset, filling in the missing species with negative infinity, and computing the sigmoid of the logits. We perform fine-tuning of the model by training a new classification head on the training dataset using the thresholded predictions of the model as pseudo-labels for the multi-label classification task. We take advantage of the species label of the folder according to Section 5.3 as one form of augmentation. We also fine-tune the model on the unlabeled soundscape data.

We experiment with diferent losses to optimize the multi-label classifier, including binary crossentropy (BCE), asymmetric loss (ASL), and sigmoidF1 which are explained in Section 5.4. We experimented with a hidden layer to increase the capacity of the models. The model is trained for 20 epochs with a batch size of 1,000 and a learning rate calculated by PyTorch Lightning. The training dataset is split with an 80-20 train-validation split. The model is trained on a single NVIDIA L4 on a Google Cloud Platform (GCP) g2-standard-8 instance with 8 vCPU, 16GB of memory, and 375GB of local NVME storage for dataset caching and model checkpoints.

We use BirdNET V2.4 through joeweiss/birdnetlib and EnCodec through facebookresearch/encodec v0.1.1 as comparisons for knowledge transfer. Though both the Bird Vocalization model and BirdNET provide predictions for classification, the former provides a more extensive set of species that overlaps with this year’s competition. We ignore the outputs of the BirdNET model for our experiments and focus on learning the distribution of the Bird Vocalization model’s outputs.

5.2. Data Preprocessing

We pre-compute the embeddings and predictions of the Bird Vocalization model on the training and unlabeled soundscape datasets into a binary, columnar format that is easily accessible from network storage. The embeddings are in a ℛ1280 space, while predictions are limited to the competition’s species set. If the species is not present, its prediction is set to zero by assigning negative infinity to the logit output. To save computation, we also pre-compute and join the embeddings from BirdNET and EnCodec with the predictions of the Bird Vocalization model.

For BirdNET, we must align the model’s input size of 48kHz of 3 seconds to the 32kHz of 5 seconds that both the Bird Vocalization model and BirdCLEF competition expect. We take the mean of the embeddings of the 5-second audio clip with a 1-second stride for the 0th and 2nd seconds. This provides coverage of the entire audio clip while limiting the computational burden of encoding. We take 5-second embedding tokens at 24kHz, and limit the bandwidth of Encodec to 1.5kbps for an embedding space of ℛ5× 150. Increasing the bandwidth to 3kpbs leads to an embedding space of ℛ5× 300. We qualitatively inspect the embeddings through a cluster analysis in Figure 1, noting the relative dificulty of separating common classes within the dataset.

5.3. Pseudo Multi-Label Construction

The training dataset lacks traditional labels for supervised learning, as the 5-second intervals in each recording are not labeled with the species present. We use pseudo-labels derived from the thresholded predictions of a surrogate model, which are not human-verified ground truth. Additionally, we use the folder species as an extra label for further training the model. The thresholded predictions are defined as a function of the model’s output ^ and a threshold threshold, with the sigmoid function denoted by .

We define an indicator variable 1call that determines whether the model output detects a birdcall, which occurs when any species prediction is positive.

^ = (g()) > threshold

|| 1call() = ∑︁ > 0

=0 1species() = {︃1 0 when ∈ when ∈/

We also generate a one-hot encoding of the folder that the current audio belongs to 1species, where is the set of species in the folder. (1) (2) (3) (4) (5) We experiment with diferent losses to optimize the multi-label classifier. The competition evaluation uses a modified ROC-AUC that skips classes with no true-positive labels. We utilize MultiLabelAUROC from the torchmetrics library as the primary learning metric. We also consider the macro-F1 score as a secondary metric, which was utilized in the 2022 edition of the BirdCLEF competition [ 3 ]. This metric allows us to inspect other aspects of the loss functions we consider in our experiments.

5.4.1. Binary Cross-Entropy

Binary cross-entropy is a loss function used for binary classification. It is suitable for multi-label classification as it treats each label as an independent binary classification problem. We use this loss as a baseline due to its simple interpretation and absence of hyperparameters.

Finally, we can define our modified label as the intersection of the model’s output and the species of the folder. This can be implemented as a vectorized operation in PyTorch.

^species = ^ ∨ (1call(^) ∧ 1species())

We use a threshold of threshold = 0.5 when defining all labels. Experiments on the unlabeled soundscapes do not have the additional information provided in the training dataset, and thus we are limited to the pseudo-labels ^.

5.4. Training Losses

= − ∑︁ , log(,) =1

5.4.2. Asymmetric Loss (ASL)

The asymmetric loss [ 13 ] penalizes false positives and false negatives diferently. This construction dynamically down-weights easy negative samples, hard thresholds them, and ignores misclassified samples. This loss is well-suited for our problem domain since we have fuzzy labels from another model initially intended for single-label classification.

= ︂{ −+ == ((1− ) −) l+oglo(g1(−)) (6)

The loss is defined in terms of the probability of the network output and hyper-parameters + and − . Setting + > − emphasizes positive examples while setting both terms to 0 yields binary cross entropy. We sweep over parameters + ∈ {0, 1} and − ∈ {0, 2, 4}, while the default values are + = 1 and − = 4, 5.4.3. sigmoidF1 The sigmoidF1 loss [ 14 ] optimizes the F1 score directly by creating a diferentiable approximation of the F1 score. Though the competition does not score with F1, it provides a useful point of comparison with other losses. We first define the true positive, false positive, false negative, and true negative terms as a function of the sigmoid function. (7) (8) (9) where S(y^) is the sigmoid function applied to the model’s output y^.

̃︀ = ∑︁ S(y^) ⊙ y ˜ = ∑︁ S(y^) ⊙ (1 − y) ˜ = ∑︁(1 − S(y^)) ⊙ y ˜ = ∑︁(1 − S(y^)) ⊙ (1 − y) (; , ) =

1 1 + exp(− ( + )) Then we define the F1 score as a function of the true positive, false positive, and false negative terms. ℒ̃︁1 = 1 − ̃︁1, where ̃︁1 = 2̃︀ + ̃ ︁ + ̃ ︁

We are given two hyper-parameter = − and = . We sweep over parameters ∈ {− 1, − 15, − 30} and ∈ {0, 1} as suggested in the author’s experiments. 2̃︀

6. Results

We obtain results for various models on the leaderboard via code submission on Kaggle. We report the best validation F1 and AUROC scores, together with private and public leaderboard scores. All submissions were made past the competition deadline with the exception of the starter Keras notebook. We submit a model that predicts 0 for every species on the leaderboard, leading to a private and public score of 0.5. We submit the predictions from the Bird Vocalization model and obtain a private and public score of 0.516625 and 0.556097 respectively.

6.1. Loss Comparisons

In Table 3, we train a linear classifier head against combinations of BCE and ASL with the addition of the species label logic. We report our validation F1 and AUROC scores alongside the private and public scores. We note that AUROC quickly saturates against the validation set used in the training dataset. The validation F1 score however correlates more strongly with the leaderboard scores. Using the species labels typically increases the score by 0.05 e.g. ASL with default parameters goes from 0.529 to 0.576 in the public leaderboard.

We experimented with adding a hidden layer behind the classification head to encourage the model to learn more complex patterns. Using ASL as the loss function, we varied the hyperparameters listed in Table 4. We confirmed the eficacy of the species logic but noted that the scores were marginally lower than those of the linear models. Additionally, we found that the default parameters of ASL are efective in most tasks, with minimal tuning needed for good performance on domain-specific tasks.

6.2. Embedding Model Comparisons

We summarize the performance of each loss function across the CNN-based models in Table 5. Due to CPU-time limitations on notebook runtime, we do not include an EnCodec-based model. Our best model on the public leaderboard uses BirdNET embeddings and the BCE loss. BirdNET embeddings consistently perform better with linear models, despite the origin of the labels being the Bird Vocalization model. Access to the species label from the parent folder consistently improves scores. While BCE performs well, this behavior is not indicated by our validation and private test metrics alone.

6.3. Dataset Comparisons

In Table 6, we compare the performance of linear models trained on the soundscape dataset using ASL as the main loss. We observe two main results: (1) BirdNET embeddings outperform the bird vocalization model by 0.03 on the public leaderboard and (2), models trained on the soundscape dataset are less efective than those trained on the distribution of the training dataset. This may be attributed to ASL’s dynamic downscaling of easily classified negative labels, making the contribution of training labels more significant than the similarity to the test distribution.

6.4. Inference Runtime

We profile each model to estimate the time required to process all test soundscapes, as shown in Table 7. The Python profiler measures the time spent in each function and the number of function calls. Reading all audio into chunked arrays from disk into memory, our baseline takes approximately one minute.

The Bird Vocalization model did not complete within the Kaggle’s time constraints, taking nearly three hours according to our estimates. We compile the model using TensorFlow Lite at runtime, optimizing operations for the hardware while allowing fallback to non-lite operations. This compilation process results in an order-of-magnitude performance increase, leaving a substantial margin for additional computation. The linear classification head adds only an extra half-hour of computation. The BirdNET model also runs well within time constraints as it is compiled with TensorFlow Lite.

EnCodec exceeds the time budget, taking 2.4 hours for the base model. Experimenting with OpenVINO [ 15 ] and applying data-independent quantization and compression did not improve inference speed.

7. Discussion 7.1. Transfer Learning Experimentation

Our transfer learning experiments using the Bird Vocalization classifier exhibit diferent behaviors between the private and public leaderboards. While fine-tuned models outperform the base model when trained on the subset of species provided for the competition, we hypothesize a shift in the species distribution between the private and public test sets. The Bird Vocalization model is trained on a more balanced dataset drawn from a larger set of species, whereas our transfer learning techniques rely on pseudo-labeling from the donor model, which may not be well-calibrated for this task. We did not account for the skew in the training data, apparent from the distribution of audio of each species.

We address label skew through diferent loss function choices. We use a secondary metric during training to provide another axis to compare models. When fine-tuning the Bird Vocalization classifier to learn the outputs from the original classifier head, the AUROC loss converges close to unity across various architectures. However, diferent losses exhibit varying learning behaviors against the F1score, with some designed to be better surrogates than binary cross-entropy loss. During transfer learning, these losses provide a smooth, monotonic increase to the validation F1-score, indicating that Bird Vocalization embeddings ofer a "good" representation of domain-specific data for the multi-label problem. We observe diferent behaviors in other embedding spaces, supported by our clustering charts.

To address skew in the training dataset, the organizers provide unlabeled soundscapes representative of the hidden-test dataset. We discuss the distributional shift between species and frequent itemsets in Section 4. Figure 5 shows the active intervals of calls, revealing diferences in data geometry. The train datasets at the bottom have tightly clustered logits, likely representing peaks in species probability distributions. The embeddings form a large central cluster with several outliers, probably representing distinctive calls. Conversely, the soundscape logit space forms two major clusters, reflecting the smaller set of species present. Thus, soundscape embeddings should closely reflect clusters of birdcalls. It would be interesting to explore how well we can discriminate between recording sites, as location likely correlates with species distribution and co-occurrence patterns.

We expect soundscapes to better represent the species distribution in the hidden test set. However, our results show that the models trained in the soundscapes perform worse than those trained on the original dataset. Although the addition of soundscapes adds an interesting dimension to the competition, it requires more than cursory experimentation to incorporate into modeling efectively.

7.2. Self-Supervised Neural Codecs

We find that EnCodec does not transfer well with similar experiments involving the linear and two-layer classifiers, achieving validation F1-scores below 0.1. Adding an LSTM layer to handle the sequential nature of EnCodec embeddings did not improve the scores. A much deeper model, similar to the EnCodec decoder [ 10 ], is likely needed to learn from the quantized embeddings, but this is not feasible within the competition’s inference time constraints.

Additionally, EnCodec is computationally expensive and dificult to adapt to the constrained submission environment. The Python profiler identified model inference as the bottleneck, with most time spent on EnCodec inference. OpenVINO post-training optimizations for quantizing and compressing weights do not significantly improve inference throughput, likely due to existing optimizations in the upstream library. A 1.5× speedup is needed to use EnCodec in our pipeline, indicating that further optimizations are required to leverage neural codecs based on large datasets trained with attention and self-supervision.

8. Future Work

Exploiting co-occurrence species information as a prior to the learning process could be beneficial. We have demonstrated frequent pattern mining to obtain co-occurrence distributions and quantify diferences from the training dataset. Confident relationships extracted from the data can be visualized, as shown in Figure 6, and used to reshape the probability distribution of an existing classifier to better represent the posterior of the unlabeled soundscape.

We aim to explore alternative parameterizations of sequential models that are computationally viable for future competitions. The competition’s trade-ofs favor compact domain-specific models over large neural networks, focusing on linearithmic algorithms like the Fast Fourier Transform for input representation. Finding a pre-trained neural audio codec with fewer parameters that fit within our computational budget and pass human perceptual tests could be viable. Alternatively, training models from scratch using diferent architectures via distillation methods, compatible with the encoder-decoder architecture used in EnCodec, could be explored. State-space models like Mamba [ 16 ] provide an appealing alternative to attention-based methods, potentially staying within our computational budget.

9. Conclusion

Our study demonstrates the efectiveness of transfer learning in birdcall classification using embeddings from pre-trained models like Google’s Bird Vocalization Classification Model and BirdNET. These embeddings capture meaningful structures that are beneficial for multi-label classification, although they do not outperform many top models in the competition. Our best-performing model, which uses BirdNET embeddings and Bird Vocalization pseudo labels to train a linear classifier, achieved a 0.63 score on the post-competition public leaderboard. Future work will focus on optimizing computational eficiency and exploring alternative model architectures to better handle the sequential nature of audio data. We also plan to incorporate species co-occurrence patterns to further enhance classification accuracy. Our code is available at github.com/dsgt-kaggle-clef/birdclef-2024.

Acknowledgments

Thank you to the Data Science at Georgia Tech (DS@GT) club for providing hardware for experiments, and to the organizers of BirdCLEF and LifeCLEF for hosting the competition.

[1]

Kahl ,

Denton ,

Klinck ,

Ramesh ,

Joshi ,

Srivathsa ,

Anand ,

Arvind ,

CP ,

Sawant ,

V. V.

Robin ,

Glotin ,

Goëau ,

W.-P.

Vellinga ,

Planqué ,

Joly , Overview of BirdCLEF 2024: Acoustic identification of under-studied bird species in the western ghats , Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum ( 2024 ).

[2]

Joly ,

Picek ,

Kahl ,

Goëau ,

Espitalier ,

Botella ,

Deneu ,

Marcos ,

Estopinan ,

Leblanc ,

Larcher ,

Šulc ,

Hrúz ,

Servajean , et al., Overview of lifeclef 2024 : Challenges on species distribution prediction and identification , in: International Conference of the CrossLanguage Evaluation Forum for European Languages , Springer, 2024 .

[3]

Kahl ,

Navine ,

Denton ,

Klinck ,

Hart ,

Glotin ,

Goëau ,

W.-P.

Vellinga ,

Planqué ,

Joly , Overview of birdclef 2022: Endangered bird species recognition in soundscape recordings ., in: CLEF (Working Notes) , 2022 , pp. 1929 - 1939 .

[4]

Durak ,

Arikan , Short-time fourier transform: two fundamental properties and an optimal implementation , IEEE Transactions on Signal Processing 51 ( 2003 ) 1231 - 1242 .

[5]

Kahl ,

C. M.

Wood ,

Eibl ,

Klinck , Birdnet: A deep learning solution for avian diversity monitoring , Ecological Informatics 61 ( 2021 ) 101236 .

[6]

Denton ,

Wisdom ,

J. R.

Hershey , Improving bird classification with unsupervised sound separation , in: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , IEEE, 2022 , pp. 636 - 640 .

[7]

LeCun , Y. Bengio, G. Hinton, Deep learning , nature 521 ( 2015 ) 436 - 444 .

[8]

Ghani ,

Denton ,

Kahl ,

Klinck , Global birdsong embeddings enable superior transfer learning for bioacoustic classification , Scientific Reports 13 ( 2023 ) 22876 .

[9]

Wang ,

Huang ,

Rudin ,

Shaposhnik , Understanding how dimension reduction tools work: An empirical approach to deciphering t-sne, umap, trimap, and pacmap for data visualization , Journal of Machine Learning Research 22 ( 2021 ) 1 - 73 . URL: http://jmlr.org/papers/v22/ 20 - 1061 . html.

[10]

Défossez ,

Copet , G. Synnaeve,

Adi , High fidelity neural audio compression ( 2022 ). arXiv: 2210 . 13438 .

[11]

Klinck ,

Demkin ,

Dane ,

Kahl ,

Denton ,

Ramesh , Birdclef 2024 , 2024 . URL: https: //kaggle.com/competitions/birdclef-2024.

[12]

Han , J . Pei,

Yin , Mining frequent patterns without candidate generation , ACM sigmod record 29 ( 2000 ) 1 - 12 .

[13]

Ridnik ,

Ben-Baruch ,

Zamir ,

Noy , I. Friedman,

Protter ,

Zelnik-Manor , Asymmetric loss for multi-label classification , in: Proceedings of the IEEE/CVF international conference on computer vision , 2021 , pp. 82 - 91 .

[14]

Bénédict ,

Koops ,

Odijk , M. de Rijke, Sigmoidf1: A smooth f1 score surrogate loss for multilabel classification , arXiv preprint arXiv:2108.10566 ( 2021 ).

[15]

Gorbachev ,

Fedorov , I. Slavutin ,

Tugarev ,

Fatekhov ,

Tarkan , Openvino deep learning workbench: Comprehensive analysis and tuning of neural networks inference , in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2019 .

[16]

Gu , T. Dao, Mamba: Linear-time sequence modeling with selective state spaces , arXiv preprint arXiv:2312.00752 ( 2023 ).