=Paper=
{{Paper
|id=Vol-3740/paper-199
|storemode=property
|title=Improving Bird Recognition using Pseudo-Labeled Recordings from the Target Location
|pdfUrl=https://ceur-ws.org/Vol-3740/paper-199.pdf
|volume=Vol-3740
|authors=Mario Lasseck
|dblpUrl=https://dblp.org/rec/conf/clef/Lasseck24
}}
==Improving Bird Recognition using Pseudo-Labeled Recordings from the Target Location==
<pdf width="1500px">https://ceur-ws.org/Vol-3740/paper-199.pdf</pdf>
<pre>
                         Improving Bird Recognition using Pseudo-Labeled Recordings
                         from the Target Location

                         Mario Lasseck
                         Museum für Naturkunde Berlin, Germany


                                                Abstract
                                                This paper presents a deep learning approach to identify bird species in soundscape recordings
                                                with Convolutional Neural Networks (CNNs). The proposed method employs an iterative
                                                process to create pseudo labels for a large number of unlabeled recordings from the target
                                                location and applies them during training to significantly improve model performance and
                                                address the domain shift between training and test data. The effectiveness of the approach is
                                                evaluated in the BirdCLEF 2024 competition hosted on Kaggle, where it achieves a macro-
                                                averaged area under the ROC curve (AUC) of 69 % on the official test set. This performance
                                                positions the method among the top two systems for identifying birds in wildlife monitoring
                                                recordings of the Western Ghats, a major biodiversity hotspot in India.


                                                Keywords 1
                                                Bird Species Recognition, Biodiversity Assessment, Soundscapes, BirdCLEF, Deep Learning,
                                                Domain Adaptation, Pseudo-Labeling, Semi-Supervised Learning, Kaggle Competition


                         1. Introduction
                            The BirdCLEF 2024 competition focuses on developing automated systems for detecting and
                         classifying under-studied bird species in the Western Ghats. This mountain range, a global biodiversity
                         hotspot in India, hosts a variety of endemic and endangered species, including many found nowhere
                         else in the world. As the region faces drastic landscape and climatic changes, there's an urgent need for
                         advanced conservation tools to assess and monitor its unique birdlife. The challenge aims to identify
                         native species of the Western Ghats sky-islands, classify rare birds with limited training data and detect
                         elusive nocturnal species. This year's edition introduces several challenges and unique aspects:

                                      •      Participants must address a significant domain shift between the training data, which
                                             consists of focal recordings from various locations, and the test data, which comprises
                                             soundscapes from the Western Ghats.
                                      •      The competition imposes a strict time limit for species identification in the test set, adding
                                             a practical constraint that mirrors real-world applications to assess and monitor biodiversity.
                                      •      To aid in bridging the domain gap, an additional unlabeled dataset from the target location
                                             is provided, allowing participants to explore un- and semi-supervised learning techniques.


                         CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         EMAIL: Mario.Lasseck@mfn.berlin
                         ©️ 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
    By improving the accuracy and efficiency of bird identification algorithms under these constraints,
this initiative supports ongoing conservation efforts, such as those led by V. V. Robin's Lab at IISER
Tirupati [1]. These innovations will empower researchers and practitioners to more effectively track
avian population trends, evaluate threats and refine their conservation strategies in this ecologically
crucial region.
    Further details about the BirdCLEF 2024 competition are given in [2], [3] and [4]. The task is part
of the LifeCLEF 2024 evaluation campaign [5,6] and the Conference and Labs of the Evaluation Forum
[7,8].


2. Materials and Methods
    The implementation of the machine learning based system for bird species recognition presented in
this paper builds upon solutions for previous BirdCLEF competitions and similar tasks [9,10,11,12,13].
Further details on own past developments and implementation methods can be found for example in
[14], [15], [16] and [17].


    2.1.         Datasets
    The BirdCLEF 2024 training data consists of 24459 audio recordings provided by Xeno-canto [18],
covering 182 different bird species. Unique to this year’s task, an additional 8444 unlabeled recordings
are provided from the same location as the test set soundscapes. Table 1 provides an overview of the
individual datasets and their characteristics. All recordings are resampled to 32 kHz, converted to mono,
and compressed to Ogg format.
    Xeno-canto files are weakly labeled, meaning there is no precise information on the presence or
absence of the labeled bird within the recording. However, there is a high probability of hearing the
labeled bird at the beginning of each audio file, as recordists often trim their recordings accordingly
before uploading them. To exploit this characteristic, only the first 5 seconds of recordings are used for
training. For some recordings, one or more background species are also provided as secondary labels.
    For cross-validation, the training dataset is split into 5 or 8 stratified randomized folds, ensuring that
primary species are proportionally represented in each fold.


Table 1: Datasets overview and statistics

                               Training set                          Unlabeled set        Test set
 Recording type                Focal                                 Soundscape           Soundscape
 Source                        Various locations (Xeno-canto)        Western Ghats        Western Ghats
 # Recordings                  24459                                 8444                 1100
 Min. duration per rec.        0.47s                                 20s                  4m
 Max. duration per rec.        1h 39m 24s                            4m                   4m
 Acc. duration all rec.        11d 20h 50m 30s                       23d 6h 19m 11s       3d 1h 20m
 # Species / Classes           182                                   unknown              unknown
 Min. # rec. per class         5                                     unknown              unknown
 Max. # rec. per class         500                                   unknown              unknown
    2.2.        Feature Engineering
    The public notebook [19] of Salman Ahmed [20] was used as a baseline for feature engineering and
early model training, following discussions on the Kaggle forum [21] initiated by lihaoweicvch [22].
All models are trained on 5-second audio chunks represented as spectrograms. The raw 1D audio signal
is converted to a 2D log Mel spectrogram image using the MelSpectrogram [23] and AmplitudeToDB
[24] classes from the torchaudio.transforms library [25].

   The baseline system uses:

       •   First 5 seconds of training files and no extra recordings or classes from other sources
       •   Model input: resized 3 channel Mel spectrogram images of size 256x256 pixel
       •   CNN backbone: eca_nfnet_l0 [26] pretrained on ImageNet [27]
       •   Mel spectrogram parameters:
               o n_fft = 2048
               o hop_length = 512
               o n_mels = 128
               o f_min = 20
               o f_max = 16000
       •   Training parameters:
               o CosineAnnealingLR scheduler [28] with 5 warmup epochs [29]
               o Peak learning rate 1e-4
               o 100 epochs with early stopping if AUC is not improving for 7 epochs
               o Batch size 64
               o Average of binary cross-entropy [30] and focal loss [31] as loss function
               o Generalized-Mean (GeM) pooling
       •   Augmentations:
               o HorizontalFlip [32]
               o CoarseDropout [33]
               o Mixup of Mel spectrogram images within training batches

    This system achieves a maximum AUC of 66 % on the public test set. From this baseline,
experiments were conducted with different CNN backbones, hyperparameter settings, augmentation
methods and input image sizes. A major drawback of the initial model was its relative long submission
time of over one hour. In addition to improving the score, one objective was to reduce inference time
to fit more models in an ensemble without exceeding the 2-hour submission time limit. To address this,
the CNN backbone was replaced with an EfficientNet B0 architecture (tf_efficientnet_b0_ns [34]) and
the Mel spectrogram image was reduced to smaller dimensions. Results were initially unstable, with a
public leaderboard score ranging from 62 % to 66 % AUC and were very sensitive to different
combinations of Mel parameters and input image sizes. However, with further adjustments, it was
possible to create single models with an inference time of around 12 minutes, still achieving a score of
approximately 65 % AUC.

   Main changes to the initial model included:

      •    CNN backbone: tf_efficientnet_b0_ns
      •    5 dropout layers before the fully connected classification layer (inspired by models of
           BirdCLEF2021 2nd [35] and BirdCLEF2023 4th [36] place solutions)
      •    Higher learning rate (1e-3), less warmup epochs (3) and less training epochs (50)
      •    Different Mel parameters (n_mels, hop_length)
      •    Additional augmentation: local and global time and frequency stretching performed on Mel
           spectrogram images via resizing parts and/or the entire image
      •    Creating checkpoint soups instead of using early stopping
    2.3.         Training Methods
    The training data is divided into 5 or 8 folds, stratified according to primary labels. Only the first 5
seconds of each audio file are used for training. The models are trained using Convolutional Neural
Network (CNN) backbones, specifically tf_efficientnet_b0_ns, which are pretrained on ImageNet. The
training process employs the AdamW [37] optimizer and a one-cycle CosineAnnealingLR scheduler
with a peak learning rate of 1e-3 and 3 warmup epochs. The average of binary cross-entropy and focal
loss is used to optimize model performance.
    For validation, the first 5 seconds of the files in the validation set are used to track learning progress
through evaluation metrics Label Ranking Average Precision (LRAP) [38], cMAP [39], F1 [40] and
AUC [41]. Background species are included with a target value of 1.0 and are treated equally to primary
labeled species.
    To enhance model stability and performance, "checkpoint soups" are used for single model
inference. This follows the idea of model soups [42]. But here, weights from different checkpoints of
the same model (typically from epochs 13-50) are averaged, provided there is an improvement in local
cross-validation scores in at least one of the tracked metrics. This approach leads to more stable and
occasionally better performance. For ensemble inference, predictions from several models are
combined using simple mean averaging.
    The above-described modifications to the baseline model allowed the creation of an ensemble of six
models, achieving 70 % AUC. This ensemble was subsequently used to generate a first set of pseudo
labels.

   Performance Improvement with Pseudo Labels

    Pseudo labels are created by applying the model ensemble on the unlabeled recordings from the test
location. The predictions from all 5-second intervals of the 8444 unlabeled soundscapes form a large
set of 401947 soft pseudo labels.
    In the subsequent training stages, randomly selected audio segments from the pseudo-labeled
recordings are mixed with the training samples at a probability of 25 to 45 percent. Before combining
the audio signals, the amplitudes of both waveforms are multiplied by a random factor. The target vector
of the training sample (with a value of 1.0 for primary and secondary species and 0 for others) is
combined with the pseudo label vector (containing predicted probabilities) to form the new target vector
by taking the maximum value of both.
    Incorporating pseudo labels into training significantly improved scores for both single models and
ensembles. The enhanced ensemble was then used to generate a new set of pseudo labels and this cycle
was repeated multiple times to progressively improve model and ensemble performance. The iterative
pseudo-labeling process is described in Figure 1. Its impact on public and private leaderboard scores is
illustrated in Table 2 and visualized in Figure 2.


Figure 1: Iterative pseudo-labeling process to improve single model and ensemble performance
Table 2: Performance improvement using pseudo labels from different training stages

   Stage    Pseudo labels                         Single model (ID 4)               Ensemble
                                                publ. | priv. LB AUC [%]     publ. | priv. LB AUC [%]
     0      -                                       65.735 | 59.270             70.065 | 61.738
     1      From stage 0 ensemble                   69.165 | 66.119             71.090 | 67.084
     2      From stage 1 ensemble                   69.936 | 67.445             72.528 | 69.035
     3      From stage 2 ens. (normalized)         71.154 | 67.683              71.716 | 69.527

   After the second iteration, pseudo label values became too large and required normalization by
rescaling them back to the range [0,1] to allow stable model training. Unfortunately, the stage 3
ensemble was not selected for final ranking because public leaderboard score did not reveal the expected
improvement.


Figure 2: Visualization of performance improvement using pseudo labels from different training stages


   Post-Processing

   Models are ensembled by simply taking the mean of predictions (probabilities from sigmoid outputs)
of each individual model. As a final step, for each test file, predictions of a given time window are
summed with those of the two neighboring windows using an aggregation factor of 0.5. This post-
processing method was previously applied by Theo Viel and his team in the 3rd place solution [43] of
the Cornell Birdcall Identification competition [44].


   Inference Optimizations

   To speed up inference, audio files from the test set are preprocessed in parallel using multithreading.
Additionally, different versions of Mel spectrogram images are pre-calculated and reused for different
models in the ensemble. By including models that work on smaller image sizes, ensembles of up to six
models can run within the 2-hour limit to create predictions for all 1100 recordings in the test set.
   Due to variations in the hardware provided by Kaggle for running inference notebooks, particularly
in CPU types, the number of models that could be ensembled to identify all birds in the test set within
the given time frame varied. To prevent submission errors, a timer is implemented in the notebook to
ensure completion within the 2-hour limit. If the timer reaches approximately 118 minutes, inference is
stopped and results are collected for all models and predicted file parts up to that point. Predictions
from unfinished models or file parts are masked before averaging.
3. Results
    The training and pseudo-labeling approach described in this paper secured 2nd place among a total
of 974 participating teams. Final scores on the public and private leaderboards, as well as the ranking
of the top 10 teams, are presented in Table 3. By combining several diverse models, a macro-averaged
ROC-AUC of 69.035 % was achieved on the complete test set (see team 'adsr' in Table 3).

Table 3: Competition results of the top 10 teams (with solution of team 'adsr' described in this paper)

  Rank    Team Name on Kaggle                                                    AUC [%]         AUC [%]
                                                                                (publ. LB)       (priv. LB)
    1     Team Kefir                                                             73.857           69.039
    2     adsr                                                                   72.794           69.035
    3     NVBird                                                                 74.212           68.997
    4     Team Cerberus                                                          74.691           68.777
    5     coolz                                                                  74.396           68.717
    6     penguin46                                                              72.039           68.716
    7     Team Unicorn                                                           72.809           68.383
    8     kapenon                                                                69.660           67.928
    9     Aphysict                                                               71.453           67.891
    10    Tamo                                                                   70.132           67.623


   Parameters and performance of the six models from the 2nd place solution (2nd stage ensemble in
Table 2) are detailed in Table 4. Model diversity in the ensemble is achieved by varying Mel parameters,
data subsets, image sizes, the probability of adding pseudo labels and amplitude factors to adjust the
volume ratio between training and pseudo-labeled data. The parameters ampExpMin and ampExpMax
in Table 4 specify the range for the random amplitude factor applied to training and pseudo-label
samples to adjust their volume in the mix:

   ampFactor = 10**(random.uniform(ampExpMin, ampExpMax))


Table 4: Single model parameters and performances of the 2nd place ensemble

Params. / Model ID                   1            2            3            4                5                6

seed                                42           42           42           42             70              42
n_folds                              5            5            5            5             10                5
fold                                 4            1            4            4              0                4
dataset                          bc24         bc24         bc24         bc24          bc24+             bc24
n_mels                            128          128          128            64             64              64
hop_length                        512          512         1024         1024           1024             1024
image_height                      256          256          128            64             64              64
image_width                       256          256          128          128            128               64
pseudoLabelChance [%]               35           40           45           30             30              25
ampExpMin                         -0.5         -1.0         -0.5         -0.5           -0.5             -0.5
ampExpMax                          0.1          0.2          0.1          0.1            0.1              0.1

Inference time              ~ 50 min.    ~ 50 min.    ~ 17 min.    ~ 12 min.      ~ 12 min.        ~ 11 min.
Public LB AUC [%]              73.270       71.975       71.104       69.936         69.124           69.309
Private LB AUC [%]             68.521       68.533       68.116       67.445         64.543           65.862
   Model 5 in Table 4 is the only one utilizing external data. For this model, additional files for the 182
species in the competition were downloaded from Xeno-canto. The first 5 seconds of each file were
added to the training set, with shorter files being padded with zeros to ensure a uniform length.


4. Discussion
    As in previous editions of the BirdCLEF competition, the challenge was to use focal recordings from
Xeno-canto to train a system capable of accurately identifying bird species in soundscapes. The
inference time was again limited to 2 hours. However, compared to last year, over twice the amount of
data had to be processed within that time (recordings with a total duration of 1 day, 9 hours and 20
minutes in 2023 vs. 3 days, 1 hour and 20 minutes in 2024). This placed even more constraints on the
size and number of models that could be used to process all recordings in the test set. Other challenges
included the extreme domain shift between training and test data, a significant class imbalance in the
training samples (with some classes having only five example recordings per species) and the lack of
diversity in the training material for many under-studied species in the target location.
    Fortunately, a large set of unlabeled soundscapes from the same locations as the test data was
provided this year. With this dataset, it was possible to create pseudo labels and find an effective method
of incorporating them into training to significantly improve identification performance. The approach
described in this paper, using pseudo-labeled data from soundscapes of the deployment location,
combines several advantages:

       1. Noise augmentation: By mixing training samples with samples from the target domain, the
          model learns how species sound within the environmental background noise of the test site
          habitat. This helps to address the domain shift between Xeno-canto recordings and test
          soundscapes.
       2. Training data extension: The model receives more training samples representing the noise
          characteristics and species distribution of the deployment location.
       3. Knowledge distillation: Since pseudo labels are derived from predictions of a stronger
          model (or ensemble of models in this case), its knowledge is transferred during training to
          the smaller model.


    For pseudo-labeling, only ensembles that fit the time limit constraint were used for inference. Using
larger ensembles or including models with stronger backbones (e.g. with a higher number of layers for
feature extraction) would likely lead to better pseudo labels. It would be interesting to investigate in
future experiments how much further scores can be improved if stronger pseudo labels are incorporated
during training.
    With only two of the best models from the 2nd place system (models 1 and 2 in Table 4), it is possible
to achieve a private leaderboard score of 69.694 % AUC. The combination of these two models takes
much less time for inference compared to using all six models. It surpasses the score of the entire
ensemble and even the 1st place system of the competition (69.0391 % AUC). Another interesting
finding is that, combined with pseudo-label training, the SED architecture with attention on frequency
bands from last year [16] achieves the best single model score (69.701 % AUC on private leaderboard).
This again proves that the feature engineering, network architecture, augmentation techniques and
training methods of the BirdCLEF 2023 3rd place system [45] are quite robust and work well for the
data and species sets of this year’s task.
    A customized version of the model to identify European bird species is available on GitHub [46]. It
was successfully implemented in a number of tools and projects to assess and monitor avian biodiversity
[47,48,49,50,51,52] and is also part of Naturblick [53], a smartphone application to discover and learn
about nature in urban surroundings.
5. Acknowledgements
   I would like to thank Stefan Kahl, Holger Klinck, Maggie, Sohier Dane, Tom Denton, Vijay Ramesh,
Maximilian Eibl, Chiti Arvind, Harikrishnan C.P., Viral Joshi, V.V. Robin, Suyash Sawant, Alexis Joly,
Henning Müller, Divya Mudappa, T.R. Shankar Raman, Meghana Srivathsa, Akshay V. Anand,
Willem-Pier Vellinga and all involved institutions and individual contributors (Kaggle, Chemnitz
University of Technology, Columbia University, Google Research, Indian Institute of Science
Education and Research Tirupati, K. Lisa Yang Center for Conservation Bioacoustics, LifeCLEF,
Nature Conservation Foundation, Parry Agro Industries Ltd., Project Dhvani, Tamil Nadu Forest
Department, Tata Coffee Ltd., Tea Estates India Ltd., The Rufford Foundation, The University of
Florida and Xeno-canto) for organizing this competition.
   I also want to thank the Museum für Naturkunde and the team of the Animal Sound Archive Berlin
[54] in particular Karl-Heinz Frommolt, Olaf Jahn and Benjamin Werner for supporting my work. The
research was partly funded by the BMEL (Bundesministerium für Ernährung und Landwirtschaft)
within the project “Machbarkeitsstudie - Integration (bio-)akustischer Methoden zur Quantifizierung
biologischer Vielfalt in das Waldmonitoring” (FKZ: 2221NR050B).


6. References
    [1] https://www.skyisland.in/
    [2] Klinck H, Maggie, Dane S, Kahl S, Denton T, Ramesh V (2024) BirdCLEF 2024. Kaggle.
         https://kaggle.com/competitions/birdclef-2024
    [3] https://www.imageclef.org/node/316
    [4] Kahl S, Denton T, Klinck H, Ramesh V, Joshi V, Srivathsa M, Anand A, Arvind C, Harikrishnan CP,
         Sawant S, Robin VV, Glotin H, Goëau H, Vellinga WP, Planqué R, Joly A (2024) Overview of
         BirdCLEF 2024: Acoustic identification of under-studied bird species in the Western Ghats. In: Working
         Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum
    [5] https://www.imageclef.org/LifeCLEF2024
    [6] Joly A, Picek L, Kahl S, Goëau H, Espitalier V, Botella C, Deneu B, Marcos D, Estopinan J, Leblanc C,
         Larcher T, Šulc M, Hrúz M, Servajean M et al. (2024) Overview of lifeclef 2024: Challenges on Species
         Distribution Prediction and Identification. In: International Conference of the Cross-Language
         Evaluation Forum for European Languages, Springer, 2024
    [7] Faggioli G, Ferro N, Galuščáková P, García Seco de Herrera A (Ed.) (2024) Working Notes of CLEF
         2024 - Conference and Labs of the Evaluation Forum
    [8] Goeuriot L, Mulhem P, Quénot G, Schwab D, Soulier L, Di Nunzio GM, Galuščáková P, García Seco
         de Herrera A, Faggioli G, Ferro N (Ed.) (2024) Experimental IR Meets Multilinguality, Multimodality,
         and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF
         2024)
    [9] Sprengel E, Jaggi M, Kilcher Y, Hofmann T (2016) Audio based bird species identification using deep
         learning techniques. In: CEUR Workshop Proceedings.
    [10] Kahl S, Wilhelm-Stein T, Hussein H et al. (2017) Large-Scale Bird Sound Classification using
         Convolutional Neural Networks. In: CEUR Workshop Proceedings.
    [11] Grill T, Schlüter J (2017) Two Convolutional Neural Networks for Bird Detection in Audio Signals. In:
         25th      European      Signal     Processing     Conference     (EUSIPCO2017).       Kos,     Greece.
         https://doi.org/10.23919/EUSIPCO.2017.8081512
    [12] Sevilla A, Glotin H (2017) Audio bird classification with inception-v4 extended with time and time-
         frequency attention mechanisms. In: CEUR Workshop Proceedings.
    [13] Stowell D, Stylianou Y, Wood M, Pamuła H, Glotin H (2018) Automatic acoustic detection of birds
         through deep learning: the first Bird Audio Detection challenge. In: Methods in Ecology and Evolution
    [14] Lasseck M (2018) Audio-based Bird Species Identification with Deep Convolutional Neural Networks.
         In: CEUR Workshop Proceedings.
    [15] Lasseck M (2018) Acoustic Bird Detection with Deep Convolutional Neural Networks. In: Plumbley
         MD et al. (eds) Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018
         Workshop (DCASE2018), pp. 143-147, Tampere University of Technology.
    [16] Lasseck M (2019) Bird Species Identification in Soundscapes. In: CEUR Workshop Proceedings.
[17] Lasseck M (2023) Bird Species Recognition using Convolutional Neural Networks with Attention on
     Frequency Bands. In: CEUR Workshop Proceedings.
[18] https://xeno-canto.org/
[19] https://www.kaggle.com/code/salmanahmedtamu/training-0-65-0-66
[20] https://www.kaggle.com/salmanahmedtamu
[21] https://www.kaggle.com/competitions/birdclef-2024/discussion/497539
[22] https://www.kaggle.com/lihaoweicvch
[23] https://pytorch.org/audio/main/generated/torchaudio.transforms.MelSpectrogram.html
[24] https://pytorch.org/audio/main/generated/torchaudio.transforms.AmplitudeToDB.html
[25] https://pytorch.org/audio/main/transforms.html
[26] https://huggingface.co/timm/eca_nfnet_l0
[27] Deng J et al. (2009) Imagenet: A largescale hierarchical image database. In: IEEE Conference on
     Computer Vision and Pattern Recognition, 2009. pp. 248–255
[28] https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.CosineAnnealingLR.html
[29] https://github.com/ildoonet/pytorch-gradual-warmup-lr
[30] https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html
[31] https://pytorch.org/vision/main/generated/torchvision.ops.sigmoid_focal_loss.html
[32] https://albumentations.ai/docs/api_reference/augmentations/geometric/transforms/
[33] https://albumentations.ai/docs/api_reference/augmentations/dropout/coarse_dropout/
[34] https://github.com/huggingface/pytorch-image-models/blob/main/timm/models/efficientnet.py
[35] https://www.kaggle.com/competitions/birdclef-2021/discussion/243463
[36] https://www.kaggle.com/competitions/birdclef-2023/discussion/412753
[37] https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html
[38] https://scikit-
     learn.org/stable/modules/generated/sklearn.metrics.label_ranking_average_precision_score.html
[39] https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html
[40] https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
[41] https://www.kaggle.com/code/metric/birdclef-roc-auc
[42] Wortsman et. al (2022) Model soups: averaging weights of multiple fine-tuned models improves
     accuracy without increasing inference time, arXiv:2203.05482
[43] https://www.kaggle.com/competitions/birdsong-recognition/discussion/183199
[44] https://www.kaggle.com/competitions/birdsong-recognition
[45] https://www.kaggle.com/competitions/birdclef-2023/discussion/414102
[46] https://github.com/adsr71/BirdID-Europe254
[47] Stehle M, Lasseck M, Khorramshahi O, Sturm U (2020) Evaluation of acoustic pattern recognition of
     nightingale (Luscinia megarhynchos) recordings by citizens. In: Research Ideas and Outcomes 6:
     e50233. doi: 10.3897/rio.6.e50233
[48] Wägele JW, Bodesheim P, Bourlat SJ, Denzler J et al. (2022) Towards a multisensor station for
     automated biodiversity monitoring. In: Basic and Applied Ecology (59), 105-138. doi:
     10.1016/j.baae.2022.01.003
[49] Wägele JW, Tschan GF et al. (2024) Weather stations for biodiversity: a comprehensive approach to an
     automated and modular monitoring system. Advanced Books, Pensoft, Sofia, 1-218.
     https://doi.org/10.3897/ab.e119534
[50] https://www.idmt.fraunhofer.de/en/institute/projects-products/projects/devise.html
[51] https://www.museumfuernaturkunde.berlin/en/science/acoustic-forest-monitoring
[52] https://www.thuenen.de/en/fachinstitute/waldoekosysteme/querschnittsgruppen/naturschutz/projekte/in
     tegration-bio-akustischer-methoden-fuer-die-quantifizierung-biologischer-vielfalt-in-das-
     waldmonitoring-akwamo-1-2
[53] https://naturblick.museumfuernaturkunde.berlin/?lang=en
[54] https://www.museumfuernaturkunde.berlin/en/science/animal-sound-archive

</pre>