1. Introduction

Addressing the Challenges of Domain Shift in Bird Call Classification for BirdCLEF 2024

Emiel Witting

Hugo de Heer

Jefrey Lim

Cahit Tolga Kopar

Kristóf Sándor

1 0 Dream Hall , Stevinweg 4, 2628 CN Delft , The Netherlands 1 TU Delft Dream Team Epoch

This paper presents Team Epoch IV's solution to the BirdCLEF 2024 competition, which focuses on developing machine learning models for bird call recognition. The primary challenge in this competition is the significant domain shift between the Xeno-Canto recordings used for training and the passive acoustic monitoring (PAM) soundscapes used for testing. This shift poses dificulties due to diferences in recording equipment, recording conditions, and background noise, which complicates accurate species identification. We delve into the specifics of this domain shift, quantifying its impact on model performance, and we propose methods to mitigate its efects. Our approach includes a comprehensive set of data augmentations and pre- and postprocessing techniques to enhance model robustness and generalization. We performed extensive experiments to verify the efectiveness of these methods. Our findings provide a foundation for future work in addressing domain shift challenges in bioacoustic monitoring, contributing to more accurate and reliable biodiversity assessments.

eol>Bird Species Classification Domain Shift Domain Adaptation Convolutional Neural Networks Deep Learning Passive Acoustic Monitoring Kaggle Competition

1. Introduction

BirdCLEF 2024 [ 1 ] is a Kaggle competition aimed at advancing machine-learning solutions for bird call recognition, as part of LifeCLEF [ 2 ]. The primary task involves developing data processing techniques and models to identify bird species from continuous audio recordings, specifically targeting under-studied Indian bird species in the Western Ghats. This competition holds value for biodiversity monitoring, as it leverages PAM to facilitate extensive and temporally detailed surveys, contributing to conservation eforts.

Participants face several notable challenges, primarily centred around the domain shift between the training data and test soundscapes. One of the main hurdles is the diference between the Xeno-Canto recordings used for training and the PAM soundscapes used for testing. This shift is exacerbated by the fact that Xeno-Canto recordings are not expert-labelled and do not provide labels for each five-second segment, but rather for the entire file. This lack of precise labelling makes it challenging to handle secondary labels accurately. The absence of PAM data in the training set poses a significant obstacle. Participants must develop models without having access to the same type of labelled data on which their models will be evaluated, which necessitates innovative approaches to generalize efectively. Additionally, the competition imposes a strict inference time limit of two hours on a CPU, requiring eficient algorithmic implementations.

This paper presents Team Epoch IV’s solution [ 3 ] to the BirdCLEF 2024 competition, with a primary focus on analyzing and addressing the domain shift challenge. We delve into the specifics of this shift and examine its impact on the discrepancy between local cross-validation scores and the public and private leaderboard scores. Our approach includes a detailed exploration of methods to mitigate these diferences and enhance model performance across varied data domains.

The paper is structured as follows: Section 2 describes our implementation strategy, including environmental setup, data preprocessing, data augmentation, model selection, and postprocessing techniques. Section 3 discusses the domain shift between training and test data. Section 4 presents our experiments and results, including an ablation study and seed stability analysis. Section 5 discusses our ifndings, and Section 6 concludes with future work and acknowledgements.

2. Implementation

In this section, we detail our implementation strategy employed for our participation in the BirdCLEF 2024 competition. Our approach encompasses environmental and training setup, data preprocessing, data augmentation, model selection, and postprocessing techniques.

2.1. Environmental setup

During the competition, we collaborated as a team. Instead of working in notebooks, which does not allow for streamlined collaboration, we developed and used our machine learning framework Epochalyst [ 4 ]. This package contains many modules and classes extracted from previous Epoch [ 5 ] competition experience to start new competitions quickly. Epochalyst makes use of hydra to load in configuration .yaml files that specify full training or ensemble runs and instantiate elements directly into Python objects for eficient development. We used Rye [ 6 ] for project & package management and designed a custom lazy loading multiprocessing pipeline for loading audio using Dask [ 7 ] and Librosa [ 8 ]. PyTorch [9] was utilized as the main framework for training, with additional libraries such as Timm [10] for using various 2D Convolutional Neural Network architectures. Additionally, for an extra ~2× inference speed up, ONNX [11] and OpenVINO [12] were used to maximise performance. Models were trained on on-site hardware running Linux [13], specifically on PCs running AMD Ryzen 9 7950X 16-Core Processor (96GB RAM) with an NVIDIA RTX A5000 GPU using Python 3.10.13. Model training and run artefacts were logged on Weights & Biases [14] to keep a clear overview of all of our experiments.

2.2. Data Preprocessing

The BirdCLEF 2024 training dataset consists of 24459 audio .ogg files uploaded by users of Xeno-Canto [15], consisting of 182 diferent bird species. All the training audio has been resampled to 32 kHz to match the test soundscape sampling rate. We did not obtain improved results pretraining on more data from previous BirdCLEF competitions, therefore we only used this year’s data for our final submission. For training, we used a 5-fold CV with a stratified split based on the primary label of the audio file. This ensures that the species are equally represented in each fold. Taking the first 5 seconds of each audio file seemed to be optimal since the bird calls of the recordings had a higher probability of appearing early in the uploaded recordings. Some Xeno-Canto files also contained secondary labels of bird species that appeared in addition to the primary bird. For these, we set the secondary labels to 0.5 and the primary labels to 1, because the primary birds were consistently more audible in the audio files compared to the secondary birds.

2.3. Data Augmentation

We have implemented several data augmentation techniques to increase the robustness of our models and to address the domain shift between the training data and the test soundscapes. Our full augmentation pipeline can be seen below. Some of the augmentations are 1D, which means they are applied to the raw audio signal. Afterwards, we converted the signal to Mel spectrograms of 256 × 256 pixels, with a frequency range of 1Hz to 16kHz, which are then further normalized so that all values are in the range of 0 to 1. As a last step in our custom dataset, some 2D augmentations are applied. • Randomly shifting the phase of each frequency component of the signal with = 0.5 and a shift_limit = 0.5.12 • Randomly shifting the amplitude of each frequency component of the signal with = 0.5. • MixUp [16] with = 0.5:

Linearly interpolating both features and labels of two samples, with random weights. • CutMix [17] with = 0.5:

Randomly cropping and replacing part of a sample with another sample. The labels are averaged linearly with weights proportional to the length/area of each sample.

• CutMix with = 0.5.

Figure 1 above visualizes our augmentation pipeline. The phase shift aims to simulate background noise in bird regions, in an efort to reduce the shift between the clear training examples and the noisy soundscapes. Amplitude shift amplifies diferent frequencies in the signal domain to enhance robustness against bird volume variance, since there are birds in the soundscapes that are in close proximity to the recording location whilst others might be located further away. Afterwards, the CutMix1D and MixUp1D are applied in the signal domain to improve learning when there are multiple birds in the same audio file, which is a common occurrence for the soundscapes. Finally, a CutMix2D is applied after converting the previous pipeline to a Mel spectrogram. An ablation study of these augmentations can be found in section 4.1. 1This does not influence the magnitude spectrum taken over the whole recording, but when windowed magnitude spectra are extracted it has the efect seen in Figure 1 2shift_limit in the range [ 0,1 ] corresponds to a phase shift of [0,2 ]

2.4. Models

We mainly used Timm [10] for straightforward model development where we experimented with various architectures. Some of the best encoders that we have found were: convnext_tiny and eca_nfnet_l0. The convnext_tiny model got the highest public leaderboard score of 0.701 while we observed it being more unstable over multiple submitted training runs. eca_nfnet_l0, on the other hand, had a slightly lower public score of 0.688 but we found it to be a more stable model during experimentation. We decided to submit these two models: the more stable one and the less stable one but with a higher public score. During training for 50 epochs total with an initial learning rate of 1− 4, Binary Cross-Entropy [18] loss was used with an AdamW [19] optimizer. A single cycle CosineAnnealing learning rate scheduler was employed with a slight warmup of 2 epochs to ensure initial stability. Furthermore, models had a sigmoid activation function to ensure that the logits ranged between 0 and 1. Local evaluation was done on every 5 seconds of each file, using the AUROC [ 20] metric where we observed a significant shift between our local scores and public scores. We were able to optimize our local scores to ~0.995 AUROC, by adding dropout, training on multiple datasets from previous BirdCLEF 3 competitions and including additional augmentations. However, any optimization above ~0.98 local, caused the public score to drop significantly. Therefore observing an overfitting pattern on Xeno-Canto data while reducing the performance on the soundscapes, proposing us to focus on minimizing the shift and not optimizing on training data.

2.5. Postprocessing

The test soundscapes are 4 minutes long, where we have to predict for each 5-second window resulting in 48 predictions per soundscape. We calculate the mean bird species probability per soundscape over the 48 windows and multiply each individual prediction by the mean of the soundscape it is in. The reasoning behind this was that we saw that birds usually appear multiple times per recording, so the mean should be high for birds that are truly in the audio. This improved our scores consistently with ~0.02 on public and private.

3. Domain shift

The training data for this competition was sourced from a diferent domain than the test set for which the models are intended. This poses a problem that is highly relevant in non-competition settings, where (labelled) test data is not always available. The impact quickly becomes obvious, when observing that models can achieve above 0.99 AUROC on held-out training data, but score below 0.70 AUROC on the test set. In this section, we will hypothesize components of the discrepancy, guided by statistical analysis, knowledge about the data source, and visual inspection. Furthermore, we attempt to quantify the domain shift and measure the impact of techniques to mitigate the shift.

3.1. Mapping the datasets

We explored the train dataset and looked for diferences with the unlabeled soundscapes, by making a visual overview. To ensure that we organize the audio in the way our model perceives it, we compared the activations of the last hidden layer of a baseline model, instead of the raw input. For both domains, the first five seconds of one thousand unique recordings were fed through the model. The activations were then projected using UMAP[21] onto ℛ2. This shows that there is partial overlap between the domains, and part of the test domain seems completely outside of the training distribution (Figure 2a).

We then plotted the corresponding spectrograms at the positions of their UMAP embedding. This allowed us to understand and identify the diferent regions of the dataset manually, as shown in Figure 3. Quadrants I and II contain mostly no-calls. It makes sense for these to be outside of the training distribution, which should only have labelled bird calls. The diference appears to be that II is fully 3BirdCLEF 2020, 2021,2022 and 2023 Xeno-Canto and labelled soundscape data retrieved from Zenodo. (a) Distribution shift (b) Top-20 known labels quiet, or at least has uniform noise, whereas I has noisy recordings with sounds other than birds. Most bird calls appear in III and IV. III Contains mostly training data, which seems to be characterized by low-noise, high-contrast bird call recordings. Towards region IV there is a gradient of increasing amounts of test data. This seems characterized by high background-noise images with less contrast (the background looks consistently brighter) and horizontal stripes. We assume the horizontal stripes are likely insects, such as cicadas. Note that there are in fact some training samples that fit into this distribution, as can be seen in both Figure 2a and in the bottom right of 3.

3.2. Shift causes

It is important to be cautious about assuming the nature of the problem. High train performance and low test performance on another domain seem to warrant (unsupervised) domain adaptation, to solve the apparent problem of domain shift. Well-documented forms include covariate shift, prior shift, or concept shift [22]. These might be approached with feature-based sample weighting, class-based sample weighting, and deep domain adaptation methods respectively. However, all of these forms rely on the assumption that there is only one type of shift at a time and that some other factor remains constant. Furthermore, it is possible that generalisation is not an issue, and that the drop in score can be explained solely by the fact that the test domain is just uniformly more dificult and ambiguous.

With this caution in mind, we hypothesized three main contributors for the drop in score: 1. The models underperform on no-call audio, which the Xeno-Canto training data does not contain. 2. The PAM test data are inherently more dificult and ambiguous to classify, even for models trained on it. 3. The PAM test data is shifted into feature distributions that our model has not encountered or generalized to properly during training.

We exclude prior shift, or label balance, as a root cause. This is because the scoring metric is mostly class-balance invariant.

Hypothesis 1 was confirmed by measuring the predictions on test samples from regions I and II that we confirmed to be no-calls. We observed that our model was consistently making predictions at around 0.5 ∼ 0.6 confidence. Those false positives occur across a handful of species. Mostly browowl1, comior1 and comkin1 for region I, and woosan for region II. We estimate this partly being due to a random bias, and correlated background noise. We have seen false browowl1 positives for several models, possibly due to those samples being recorded for a nocturnal species with mostly quiet recordings and sometimes insects.

Hypothesis 2 allows the possibility that test-like data is in fact represented in the train data, but at lower proportions. We might approximate the dificulty for the PAM-like samples, by measuring train scores in region IV. This resulted in a class-mean AUROC of 0.985. This is clearly significantly higher than the leaderboard test score, also when regarding the small sample size (154 samples with 75 unique species). This evidence contradicts hypothesis 2. A reason for not rejecting it fully is that those train samples might not be representative of test data, and that they difer along a dimension that is not captured by UMAP.

Hypothesis 3 is the standard problem of models not generalizing to data that is outside of the training distribution. The main diferences we noticed visually were the decreased contrast (low signal-to-noise ratio) and horizontal stripes from possibly insects. Furthermore, we observed more overlapping bird calls in the test soundscapes than in train audio. A participant in BirdCLEF 2023 mentioned reverb [23]. This might also be the case, although we have not had the opportunity to verify this or test reverb augmentations.

3.3. Shift mitigation 3.3.1. Call or no call classification

To mitigate the fact that our model underperforms on no-call audio, a two-stage pipeline was introduced. The first stage consists of training a model on the Freefield1010 [24] dataset to perform a binary classification task for every 5-second window to predict if there is a call or no call from birds, a f1-score of 0.810 was achieved for stage one. If the first stage predicted that there was a no call in a 5-second window of a soundscape, all predictions were set to 0 and the second stage for this window was skipped, also saving important inference time. After empirical visual inspection of the soundscapes, the two-stage appear to be correct for silent soundscapes with an example in Figure 9 and 10 illustrating predictions of our best submission compared to our two-stage approach on a silent soundscape. Interestingly, against our expectations, our public scores did not improve by submitting our two-stage approach. Further investigation with the labels of the test soundscapes is recommended to detect where our two-stage model is making its mistakes.

3.3.2. Test Audio Scaling

In order to remove the shift between distributions as mentioned in hypothesis three, we tried two techniques. The first is to scale down the test audio during inference. Because we min-max scaled our spectrograms, this is analogous to increasing in the logarithmic scaling log( + ) that we applied to spectrograms. We scale the audio down by a factor of 1/100, which we found through empirical experimentation. The efect is an increased contrast, which visually makes the test data look more similar to the train data. We consistently achieved higher scores for both the public and private leaderboards as a result.

(a) Factor 1 (b) Factor 1/100 (c) Factor 1/500

3.3.3. Frequency-based noise removal

The second technique aims to remove ambient noise. It treats the audio as a sum of infrequent (bird) noises, and background noise that is stronger in some frequencies than others, but is constant over time. Visually, this means removing all horizontal stripes from spectrograms.

To obtain a robust estimate of the background noise level per frequency, the quantile = 0.25 was used per row of the spectrogram. This implictly assumes that a bird call does not occupy the same frequency for more than three-quarters of the sample. If that assumption is true, the value will not be impacted by outliers from bird calls, which the mean would be sensitive to. This is then subtracted from the original image, an example is shown in figure 5.

3.3.4. Domain distance

To quantify to which extent the two domains were becoming more similar, we used a modified Frechet Inception Distance [25]. FID compares distributions of the activations of the last hidden layer of an Inception-v3 network between two datasets. We use the activations of our best-selected submission model instead. Because the goal was not to remove the discrepancy between call and no-call, that separation needed to be preserved, only regions III and I were used. We hypothesised that this could be used to estimate the impact of a shift mitigation technique, before training a model and evaluating it with test labels.

3.3.5. Deep Domain Adaptation

Some methods are developed specifically to mitigate the domain shift while training a deep learning model. Examples are DANN [26] and MDD [27]. They take labelled train data and unlabeled test samples. We did not have success with these techniques, however. In part, this was due to the dificulty of tuning hyperparameters for the adversarial networks that they contain, which are prone to instability. An argument for why these techniques might not be suitable at all without modification is that their objective includes removing as much diferences between train and test as possible. This could cause issues when a major diference is the existence of no-calls only in test. This might mean that optimising the objective requires removing the ability to tell bird calls from silence.

4. Experiments

This section contains experiments regarding our ablation study in section 4.1 and seed stability experiments in section 4.2. To verify the significance of improvements in the ablation study, we also investigated the efect of randomness on the leaderboard performance. For this, experiments with the same configuration but with various seeds that afect the data split, augmentations, and model weight initialization the stability of our models have been trained and evaluated.

4.1. Ablation Study

In our ablation study, we aim to analyse the efect of our augmentations. We performed 6 training runs of our best submission model with 5-fold CV, where we added one augmentation at a time. We submitted each fold individually, resulting in 30 total submissions. From Figure 6 we can observe a very high score variance within folds and a positive correlation of 0.85 between public and private. The Leaderboard group in the boxplot contains a weighted average of the public and private score and is calculated as follows: 35% Public Score + 65% Private Score. MixUp1D resulted in the most significant improvement of ~0.03 on the Kaggle leaderboard.

4.2. Seed Stability

For our seed stability experiment, we used the model pipeline from our best submission and retrained that model 5 times with diferent seeds for 5-fold CV each. We made late submissions on Kaggle for each fold, resulting in a total of 25 variations of the same model. The public and private leaderboard scores of these models are shown in figures 7a and 7b. (a) Distribution of Public & Private LB Scores

(b) Correlation between Public & Private LB Scores: 0.25

There is a significant variance for both the public and private leaderboard scores with the same model. More specifically, the public leaderboard scores have a standard deviation of 0.01197 and the private leaderboard scores have a standard deviation of 0.01091. Furthermore, we can observe a slight correlation of 0.25 between public and private leaderboard scores by only varying the seed of the run configuration.

It is worth noting that while various assembling techniques can be used to stabilize models, this is not always possible due to the strict CPU inference time limit.

4.3. Shift mitigation

To evaluate the efect of shift mitigation using the test-time scaling and frequency-based filter, 5 submissions were made for each, along with a baseline. For the frequency normalized the model was trained again with the filter applied, but the test scaling was performed at inference time with the original model weights. Again, these 5 submissions are the results of training on a diferent fold. The results are shown in 8, the score is calculated as 35% Public Score + 65% Private Score. Both methods improve the baseline, audio scaling most significantly.

For these three techniques, we measure the distance between train and test distributions as described in section 3.3.4. This was measured only for data that was in regions III and IV in the original UMAP projection, with the same model that generated figure 2. For the baseline, the modified FID was 41.4, by applying the frequency-based filter to all data, the distance shrank to 37.6. When instead rescaling only the test audio by 1/100, it increased to 46.7.

5. Discussion Augmentations

From our ablation study in section 4.1 we found that MixUp1D was one of the best-performing augmentations. We suppose this is due to the soundscapes consisting of more birds simultaneously compared to our training data. MixUp increases the models’ capability of learning diferent birds at the same time. The CutMix augmentations did not result in a significant improvement, after further analysis we observed that CutMix often cuts out a bird call and replaces it with a silent section of another bird audio fragment, therefore confusing the model with learning it to annotate a silence with a bird call. We consider that the PhaseShift augmentation was set to be too extreme, therefore adding too much noise and reducing the models’ capacity to learn the visual bird call patterns. On the contrary, the AmplitudeShift performance was inconclusive and we suggest tuning it to higher intensity to have more efect.

Seed stability

An interesting observation was that we found that our models were in general quite unstable. During our experimentation we did not find any way of locally evaluating our models, therefore we relied on the public leaderboard. In our experiments specified in section 4.2, we perceived that the same model configuration trained on a diferent seed could significantly change the public and private leaderboard performance. Therefore, indicating that randomness was highly involved during our experimentation phase. During this phase, we implemented novel ideas and the only way we found to evaluate the idea was by making a submission. However, we might have discarded ideas that got an ’unlucky’ low public score due to the elevated randomness, which was better if we had analysed the average over multiple submissions. Furthermore, it is also interesting to note that the public and private scores have a slight correlation on submitting diferent seeds, which could indicate that optimizing the seed on the public leaderboard could also transfer to a higher private score.

Shift mitigation

Visualizing and interacting with the datasets helped us understand the diferences between train and test. This guided us towards two techniques that visually removed the shift and also improved scores on the test set. However, we are not certain about the impact of the no-call segments. While it seemed that there were many false positives, the two-stage approach did not solve this problem. This might be caused by not optimizing the two-stage model suficiently, but it is also surprising that out of the top 5 solutions for BirdCLEF 2024, nobody seemed to get consistent improvements by using a call/no-call model, even though it has been used successfully in other editions. Furthermore, it is not clear what proportion of the score drop is caused by the shift in regions III and IV as in figure 3, or false positives. We are not able to confirm this without access to the labels.

Unfortunately, decreasing the FID distance between domains does not guarantee an improved test score and vice-versa. While the distance only increased for scaling and decreased for the filter, the score improved for both. The distance change might be explained by the fact that scaling was only applied to the test data, which introduces a synthetic shift that can be measured but does not negatively impact the model. The frequency-based noise filter was applied on both datasets instead of only one and removed shift distance as intended.

6. Conclusion

In this paper, we presented our solution for the BirdCLEF 2024 competition, focusing on the challenge of domain shift between the training and test datasets. Our approach includes shift mitigation through data augmentation and preprocessing. We evaluated the stochasticity of the results and performed experiments with thorough 5-fold validation.

We found that: • Domain Shift Mitigation: Applying frequency-based noise removal and scaling test samples, as guided by exploratory data analysis, proved successful. This introduces a novel filter that could be applicable to other PAM audio classification problems and generally highlights the importance of investigating the aspects of domain shift. • Data Augmentation: MixUp1D applied to audio is a particularly efective technique. This is likely due to the presence of multiple bird calls per recording in the test soundscapes. Other augmentations such as CutMix and PhaseShift did not yield improvements, these might need to be adapted or excluded in similar experiments. • Seed stability: Our seed stability experiments revealed substantial variance in public and private leaderboard scores, emphasizing the impact of randomness in model training. This underscores the importance of averaging results over multiple seeds to obtain a reliable performance estimate. It also implies that conclusions based on the competition outcome should be drawn with caution if scores are close. • Call/No Call Classification: Implementing a two-stage pipeline for call/no call classification did not result in the expected score improvements. This suggests that our current implementation may need refinement or that the issue of false positives might be more complex than anticipated.

Overall, our findings indicate that addressing domain shift is crucial for achieving robust performance in bird call classification tasks. Our methods provide a foundation for future work, including further refinement of data augmentation techniques, deeper analysis of domain shift, and more sophisticated model evaluation strategies.

In conclusion, while we achieved notable improvements, the BirdCLEF 2024 competition highlighted the ongoing challenges in developing models that generalize well across diferent acoustic environments. Our results underscore the need for continuous innovation and experimentation in tackling domain shift and enhancing model robustness.

6.1. Future work

Looking ahead, several avenues for future research and experimentation emerge from our findings. First of all, during our experimentation phase, we might have discarded ideas because they were based on a single submission. Due to the observed impact of randomness on the scores, several ideas that were initially discarded are worth investigating again, including: • Alternating Ensemble: For every 4-minute soundscape, corresponding to 48 windows, let diferent models predict the windows alternately, followed by averaging neighbouring window predictions. In this way, we are able to ensemble without increasing inference time. • Pretraining on previous years: Rerun experiments where we append BirdCLEF data including soundscape PAM data from Zenodo. • Two-stage refinement: Refining the two-stage pipeline approach and incorporating more sophisticated methods for distinguishing between bird calls and background noise in addition to the freefield1010 dataset. • Longer Window Models: Experimenting with models that process longer audio windows (e.g., 10 seconds) could provide more context and improve classification accuracy. • Multi-Channel Spectrograms: Investigating the use of multi-channel spectrograms to capture richer audio information.

Secondly, obtaining and analyzing the labels of the test soundscapes would allow us to validate our hypotheses about domain shift and the efectiveness of our mitigation techniques. Finally, we encourage the competition hosts to analyze the best solutions to the competition again. It could be very insightful to measure the score when excluding no-calls, or excluding overlapping bird calls, to isolate the efects.

Acknowledgments

We would like to thank the organizers of the BirdCLEF 2024 competition and all involved institutions. We extend our thanks to all the participants of the BirdCLEF 2024 competition who were active in the Kaggle discussion forums for their ongoing eforts in advancing the field of bioacoustics and biodiversity monitoring. Your dedication and collaboration are instrumental in driving forward conservation eforts worldwide. [9] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch: An imperative style, highperformance deep learning library, in: Advances in Neural Information Processing Systems 32, Curran Associates, Inc., 2019, pp. 8024–8035. URL: http://papers.neurips.cc/paper/ 9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf. [10] R. Wightman, Pytorch image models, https://github.com/rwightman/pytorch-image-models, 2019.

doi:10.5281/zenodo.4414861. [11] O. R. developers, Onnx runtime, https://onnxruntime.ai/, 2021. Version: 1.17.3. [12] I. Corporation, Openvino, https://docs.openvino.ai/, 2018. Open-source toolkit for optimizing and deploying deep learning models. [13] Canonical Ltd., Ubuntu 23.10 (mantic minotaur), https://releases.ubuntu.com/23.10/, 2023. Operating system release. [14] L. Biewald, Experiment tracking with weights and biases, https://www.wandb.com/, 2020. Software available from wandb.com. [15] X. canto Foundation, Xeno canto, https://xeno-canto.org/, 2005. [16] H. Zhang, M. Cisse, Y. N. Dauphin, D. Lopez-Paz, mixup: Beyond empirical risk minimization, in: International Conference on Learning Representations, 2018. URL: https://openreview.net/forum? id=r1Ddp1-Rb. [17] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, Y. Yoo, Cutmix: Regularization strategy to train strong classifiers with localizable features, CoRR abs/1905.04899 (2019). URL: http://arxiv.org/abs/1905. 04899. arXiv:1905.04899. [18] P. Shukla, How did binary cross-entropy loss come into existence?, Towards AI (2023). [19] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, arXiv preprint arXiv:1711.05101 (2017). [20] T. Fawcett, An introduction to roc analysis, Pattern recognition letters 27 (2006) 861–874. [21] L. McInnes, J. Healy, J. Melville, Umap: Uniform manifold approximation and projection for dimension reduction, arXiv preprint arXiv:1802.03426 (2018). [22] J. G. Moreno-Torres, T. Raeder, R. Alaiz-Rodríguez, N. V. Chawla, F. Herrera, A unifying view on dataset shift in classification, Pattern Recognition 45 (2012) 521–530. URL: https://www. sciencedirect.com/science/article/pii/S0031320311002901. doi:https://doi.org/10.1016/j. patcog.2011.06.019. [23] M. Lasseck, Bird species recognition using convolutional neural networks with attention on frequency bands, CEUR Workshop Proceedings (2023). URL: https://www.CEUR-WS.org/vol-3497/ paper-175.pdf. doi:https://doi.org/10.1016/j.patcog.2011.06.019. [24] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, M. D. Plumbley, An open dataset for research on audio field recording archives: freefield1010, arXiv:1309.5275, 2013. doi: 10.48550/arXiv. 1309.5275. [25] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, S. Hochreiter, Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2018. arXiv:1706.08500. [26] A. Sicilia, X. Zhao, S. J. Hwang, Domain adversarial neural networks for domain generalization:

When it works and how to improve, 2022. arXiv:2102.03924. [27] J. Li, E. Chen, Z. Ding, L. Zhu, K. Lu, H. T. Shen, Maximum density divergence for domain adaptation, IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (2021) 3918–3930.

URL: http://dx.doi.org/10.1109/TPAMI.2020.2991050. doi:10.1109/tpami.2020.2991050.

A. Extra visualizations B. Results

This section contains additional raw results from our experiments.

B.1. Seed Stability

Public LB 0.635002 0.610959 0.630144 0.620723 0.598396 0.626281 0.605735 0.613994 0.611792 0.634171 0.629907 0.590896 0.627677 0.631077 0.627975 0.650119 0.626682 0.633204 0.636626 0.629955 0.644002 0.659341 0.665605 0.675107 0.683710 0.657769 0.659733 0.648198 0.659187 0.66703

[1]

Kahl ,

Denton ,

Klinck ,

Ramesh ,

Joshi ,

Srivathsa ,

Anand ,

Arvind ,

CP ,

Sawant ,

V. V.

Robin ,

Glotin ,

Goëau ,

W.-P.

Vellinga ,

Planqué ,

Joly , Overview of BirdCLEF 2024: Acoustic identification of under-studied bird species in the western ghats , Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum ( 2024 ).

[2]

Joly ,

Picek ,

Kahl ,

Goëau ,

Espitalier ,

Botella ,

Deneu ,

Marcos ,

Estopinan ,

Leblanc ,

Larcher ,

Šulc ,

Hrúz ,

Servajean , et al., Overview of lifeclef 2024 : Challenges on species distribution prediction and identification , in: International Conference of the CrossLanguage Evaluation Forum for European Languages , Springer, 2024 .

[3]

Lim , E. Witting, H. de Heer, C. T. Kopar,

Sándor , Identifying Bird Calls, 2024 . URL: https: //github.com/TeamEpochGithub/iv-q4 - birdclef- 2024 .

[4]

Lim , E. Witting, H. de Heer, C. T. Kopar, J. van Selm ,

Ebersberger , G. Dumont,

Sándor , D. De Dios Allegue , Epochalyst, 2024 . URL: https://github.com/TeamEpochGithub/epochalyst.

[5]

Lim , E. Witting, H. de Heer, C. T. Kopar, J. van Selm ,

Ebersberger , G. Dumont,

Sándor , D. De Dios Allegue , 2024 . URL: https://teamepoch.ai/.

[6]

Ronacher , Rye: a Hassle-Free Python Experience , 2024 . URL: https://rye.astral.sh/.

[7] Dask

core developers

, Dask | Scale the Python tools you love , 2024 . URL: https://www.dask.org/.

[8]

McFee ,

McVicar ,

Faronbi , I. Roman,

Gover ,

Balke ,

Seyfarth ,

Malek ,

Raffel ,

Lostanlen , B. van Niekirk ,

Lee ,

Cwitkowitz ,

Zalkow ,

Nieto ,

Ellis ,

Mason ,

Lee ,

Steers ,

Halvachs ,

Thomé ,

Robert-Stöter ,

Bittner ,

Wei ,

Weiss , E. Battenberg,

Choi ,

Yamamoto ,

Carr ,

Metsai ,

Sullivan ,

Friesch ,

Krishnakumar ,

Hidaka ,

Kowalik , F. Keller, D. Mazur,

Chabot-Leclerc ,

Hawthorne ,

Ramaprasad ,

Keum ,

Gomez ,

Monroe ,

V. A.

Morozov , K. Eliasi, nullmightybofo, P. Biberstein,

N. D.

Sergin ,

Hennequin , R. Naktinis, beantowel, T. Kim,

J. P.

Åsen ,

Lim ,

Malins ,

Hereñú , S. van der Struijk, L. Nickel,

Wu ,

Wang ,

Gates ,

Vollrath ,

Sarrof , Xiao-Ming , A.

Porter , S.

Kranzler , Voodoohop, M. D.

Gangi , H.

Jinoz , C.

Guerrero , A.

Mazhar , toddrme2178, Z.

Baratz , A.

Kostin , X.

Zhuang , C. T.

Lo , P.

Campr , E. Semeniuc, M.

Biswal , S.

Moura , P.

Brossier , H.

Lee , W. Pimenta, librosa/librosa: 0.10.2.post1, 2024 . URL: https://doi.org/10.5281/zenodo.11192913. doi: 10 .5281/zenodo.11192913.

2 . Ablation study