Improving Bird Recognition using Pseudo-Labeled Recordings from the Target Location

Improving Bird Recognition using Pseudo-Labeled Recordings from the Target Location MarioLasseck mario.lasseck@mfn.berlin Museum für Naturkunde Berlin

Germany

Evaluation Forum

September 09-12 2024 Grenoble France

Improving Bird Recognition using Pseudo-Labeled Recordings from the Target Location 1613-0073 6F03A66A27DF775EFE57530419C04C71 GROBID - A machine learning software for extracting information from scholarly documents Bird Species Recognition Biodiversity Assessment Soundscapes BirdCLEF Deep Learning Domain Adaptation Pseudo-Labeling Semi-Supervised Learning Kaggle Competition

This paper presents a deep learning approach to identify bird species in soundscape recordings with Convolutional Neural Networks (CNNs). The proposed method employs an iterative process to create pseudo labels for a large number of unlabeled recordings from the target location and applies them during training to significantly improve model performance and address the domain shift between training and test data. The effectiveness of the approach is evaluated in the BirdCLEF 2024 competition hosted on Kaggle, where it achieves a macroaveraged area under the ROC curve (AUC) of 69 % on the official test set. This performance positions the method among the top two systems for identifying birds in wildlife monitoring recordings of the Western Ghats, a major biodiversity hotspot in India.

Introduction

The BirdCLEF 2024 competition focuses on developing automated systems for detecting and classifying under-studied bird species in the Western Ghats. This mountain range, a global biodiversity hotspot in India, hosts a variety of endemic and endangered species, including many found nowhere else in the world. As the region faces drastic landscape and climatic changes, there's an urgent need for advanced conservation tools to assess and monitor its unique birdlife. The challenge aims to identify native species of the Western Ghats sky-islands, classify rare birds with limited training data and detect elusive nocturnal species. This year's edition introduces several challenges and unique aspects:

• Participants must address a significant domain shift between the training data, which consists of focal recordings from various locations, and the test data, which comprises soundscapes from the Western Ghats. • The competition imposes a strict time limit for species identification in the test set, adding a practical constraint that mirrors real-world applications to assess and monitor biodiversity. • To aid in bridging the domain gap, an additional unlabeled dataset from the target location is provided, allowing participants to explore un-and semi-supervised learning techniques.

By improving the accuracy and efficiency of bird identification algorithms under these constraints, this initiative supports ongoing conservation efforts, such as those led by V. V. Robin's Lab at IISER Tirupati [1]. These innovations will empower researchers and practitioners to more effectively track avian population trends, evaluate threats and refine their conservation strategies in this ecologically crucial region.

Further details about the BirdCLEF 2024 competition are given in [2], [3] and [4]. The task is part of the LifeCLEF 2024 evaluation campaign [5,6] and the Conference and Labs of the Evaluation Forum [7,8].

Materials and Methods

The implementation of the machine learning based system for bird species recognition presented in this paper builds upon solutions for previous BirdCLEF competitions and similar tasks [9,10,11,12,13]. Further details on own past developments and implementation methods can be found for example in [14], [15], [16] and [17].

Datasets

The BirdCLEF 2024 training data consists of 24459 audio recordings provided by Xeno-canto [18], covering 182 different bird species. Unique to this year's task, an additional 8444 unlabeled recordings are provided from the same location as the test set soundscapes. Table 1 provides an overview of the individual datasets and their characteristics. All recordings are resampled to 32 kHz, converted to mono, and compressed to Ogg format.

Xeno-canto files are weakly labeled, meaning there is no precise information on the presence or absence of the labeled bird within the recording. However, there is a high probability of hearing the labeled bird at the beginning of each audio file, as recordists often trim their recordings accordingly before uploading them. To exploit this characteristic, only the first 5 seconds of recordings are used for training. For some recordings, one or more background species are also provided as secondary labels.

For cross-validation, the training dataset is split into 5 or 8 stratified randomized folds, ensuring that primary species are proportionally represented in each fold. This system achieves a maximum AUC of 66 % on the public test set. From this baseline, experiments were conducted with different CNN backbones, hyperparameter settings, augmentation methods and input image sizes. A major drawback of the initial model was its relative long submission time of over one hour. In addition to improving the score, one objective was to reduce inference time to fit more models in an ensemble without exceeding the 2-hour submission time limit. To address this, the CNN backbone was replaced with an EfficientNet B0 architecture (tf_efficientnet_b0_ns [34]) and the Mel spectrogram image was reduced to smaller dimensions. Results were initially unstable, with a public leaderboard score ranging from 62 % to 66 % AUC and were very sensitive to different combinations of Mel parameters and input image sizes. However, with further adjustments, it was possible to create single models with an inference time of around 12 minutes, still achieving a score of approximately 65 % AUC.

Main changes to the initial model included:

• CNN backbone: tf_efficientnet_b0_ns

Training Methods

The training data is divided into 5 or 8 folds, stratified according to primary labels. Only the first 5 seconds of each audio file are used for training. The models are trained using Convolutional Neural Network (CNN) backbones, specifically tf_efficientnet_b0_ns, which are pretrained on ImageNet. The training process employs the AdamW [37] optimizer and a one-cycle CosineAnnealingLR scheduler with a peak learning rate of 1e-3 and 3 warmup epochs. The average of binary cross-entropy and focal loss is used to optimize model performance.

For validation, the first 5 seconds of the files in the validation set are used to track learning progress through evaluation metrics Label Ranking Average Precision (LRAP) [38], cMAP [39], F1 [40] and AUC [41]. Background species are included with a target value of 1.0 and are treated equally to primary labeled species.

To enhance model stability and performance, "checkpoint soups" are used for single model inference. This follows the idea of model soups [42]. But here, weights from different checkpoints of the same model (typically from epochs 13-50) are averaged, provided there is an improvement in local cross-validation scores in at least one of the tracked metrics. This approach leads to more stable and occasionally better performance. For ensemble inference, predictions from several models are combined using simple mean averaging.

The above-described modifications to the baseline model allowed the creation of an ensemble of six models, achieving 70 % AUC. This ensemble was subsequently used to generate a first set of pseudo labels.

Performance Improvement with Pseudo Labels

Pseudo labels are created by applying the model ensemble on the unlabeled recordings from the test location. The predictions from all 5-second intervals of the 8444 unlabeled soundscapes form a large set of 401947 soft pseudo labels.

In the subsequent training stages, randomly selected audio segments from the pseudo-labeled recordings are mixed with the training samples at a probability of 25 to 45 percent. Before combining the audio signals, the amplitudes of both waveforms are multiplied by a random factor. The target vector of the training sample (with a value of 1.0 for primary and secondary species and 0 for others) is combined with the pseudo label vector (containing predicted probabilities) to form the new target vector by taking the maximum value of both.

Incorporating pseudo labels into training significantly improved scores for both single models and ensembles. The enhanced ensemble was then used to generate a new set of pseudo labels and this cycle was repeated multiple times to progressively improve model and ensemble performance. The iterative pseudo-labeling process is described in Figure 1. Its impact on public and private leaderboard scores is illustrated in Table 2 and visualized in Figure 2. After the second iteration, pseudo label values became too large and required normalization by rescaling them back to the range [0,1] to allow stable model training. Unfortunately, the stage 3 ensemble was not selected for final ranking because public leaderboard score did not reveal the expected improvement.

Post-Processing

Models are ensembled by simply taking the mean of predictions (probabilities from sigmoid outputs) of each individual model. As a final step, for each test file, predictions of a given time window are summed with those of the two neighboring windows using an aggregation factor of 0.5. This postprocessing method was previously applied by Theo Viel and his team in the 3 rd place solution [43] of the Cornell Birdcall Identification competition [44].

Inference Optimizations

To speed up inference, audio files from the test set are preprocessed in parallel using multithreading. Additionally, different versions of Mel spectrogram images are pre-calculated and reused for different models in the ensemble. By including models that work on smaller image sizes, ensembles of up to six models can run within the 2-hour limit to create predictions for all 1100 recordings in the test set.

Due to variations in the hardware provided by Kaggle for running inference notebooks, particularly in CPU types, the number of models that could be ensembled to identify all birds in the test set within the given time frame varied. To prevent submission errors, a timer is implemented in the notebook to ensure completion within the 2-hour limit. If the timer reaches approximately 118 minutes, inference is stopped and results are collected for all models and predicted file parts up to that point. Predictions from unfinished models or file parts are masked before averaging.

Results

The training and pseudo-labeling approach described in this paper secured 2 nd place among a total of 974 participating teams. Final scores on the public and private leaderboards, as well as the ranking of the top 10 teams, are presented in Table 3. By combining several diverse models, a macro-averaged ROC-AUC of 69.035 % was achieved on the complete test set (see team 'adsr' in Table 3). Parameters and performance of the six models from the 2 nd place solution (2 nd stage ensemble in Table 2) are detailed in Table 4. Model diversity in the ensemble is achieved by varying Mel parameters, data subsets, image sizes, the probability of adding pseudo labels and amplitude factors to adjust the volume ratio between training and pseudo-labeled data. The parameters ampExpMin and ampExpMax in Table 4 specify the range for the random amplitude factor applied to training and pseudo-label samples to adjust their volume in the mix: ampFactor = 10**(random.uniform(ampExpMin, ampExpMax)) 4 is the only one utilizing external data. For this model, additional files for the 182 species in the competition were downloaded from Xeno-canto. The first 5 seconds of each file were added to the training set, with shorter files being padded with zeros to ensure a uniform length.

Discussion

As in previous editions of the BirdCLEF competition, the challenge was to use focal recordings from Xeno-canto to train a system capable of accurately identifying bird species in soundscapes. The inference time was again limited to 2 hours. However, compared to last year, over twice the amount of data had to be processed within that time (recordings with a total duration of 1 day, 9 hours and 20 minutes in 2023 vs. 3 days, 1 hour and 20 minutes in 2024). This placed even more constraints on the size and number of models that could be used to process all recordings in the test set. Other challenges included the extreme domain shift between training and test data, a significant class imbalance in the training samples (with some classes having only five example recordings per species) and the lack of diversity in the training material for many under-studied species in the target location.

Fortunately, a large set of unlabeled from the same locations as the test data was provided this year. With this dataset, it was possible to create pseudo labels and find an effective method of incorporating them into training to significantly improve identification performance. The approach described in this paper, using pseudo-labeled data from soundscapes of the deployment location, combines several advantages: For pseudo-labeling, only ensembles that fit the time limit constraint were used for inference. Using larger ensembles or including models with stronger backbones (e.g. with a higher number of layers for feature extraction) would likely lead to better pseudo labels. It would be interesting to investigate in future experiments how much further scores can be improved if stronger pseudo labels are incorporated during training.

With only two of the best models from the 2 nd place system (models 1 and 2 in Table 4), it is possible to achieve a private leaderboard score of 69.694 % AUC. The combination of these two models takes much less time for inference compared to using all six models. It surpasses the score of the entire ensemble and even the 1 st place system of the competition (69.0391 % AUC). Another interesting finding is that, combined with pseudo-label training, the SED architecture with attention on frequency bands from last year [16] achieves the best single model score (69.701 % AUC on private leaderboard). This again proves that the feature engineering, network architecture, augmentation techniques and training methods of the BirdCLEF 2023 3 rd place system [45] are quite robust and work well for the data and species sets of this year's task.

A customized version of the model to identify European bird species is available on GitHub [46]. It was successfully implemented in a number of tools and projects to assess and monitor avian biodiversity [47,48,49,50,51,52] and is also part of Naturblick [53], a smartphone application to discover and learn about nature in urban surroundings.

Figure 1 :1Figure 1: Iterative pseudo-labeling process to improve single model and ensemble performance

Figure 2 :2Figure 2: Visualization of performance improvement using pseudo labels from different training stages

1. 2 .2Noise augmentation: By mixing training samples with samples from the target domain, the model learns how species sound within the environmental background noise of the test site habitat. This helps to address the domain shift between Xeno-canto recordings and test soundscapes. Training data extension: The model receives more training samples representing the noise characteristics and species distribution of the deployment location. 3. Knowledge distillation: Since pseudo labels are derived from predictions of a stronger model (or ensemble of models in this case), its knowledge is transferred during training to the smaller model.

Table 1 :1Datasets overview and statisticsTraining setUnlabeled setTest setRecording typeFocalSoundscapeSoundscapeSourceVarious locations (Xeno-canto)Western GhatsWestern Ghats# Recordings2445984441100Min. duration per rec.0.47s20s4mMax. duration per rec.1h 39m 24s4m4mAcc. duration all rec.11d 20h 50m 30s23d 6h 19m 11s 3d 1h 20m# Species / Classes182unknownunknownMin. # rec. per class5unknownunknownMax. # rec. per class500unknownunknown

o CosineAnnealingLR scheduler [28] with 5 warmup epochs [29] o Peak learning rate 1e-4 o 100 epochs with early stopping if AUC is not improving for 7 epochs o Batch size 64 o Average of binary cross-entropy [30] and focal loss [31] as loss function o Generalized-Mean (GeM) pooling • Augmentations: o HorizontalFlip [32] o CoarseDropout [33] o Mixup of Mel spectrogram images within training batches

Table 2 :2Performance improvement using pseudo labels from different training stagesStage Pseudo labelsSingle model (ID 4)Ensemblepubl. | priv. LB AUC [%]publ. | priv. LB AUC [%]0-65.735 | 59.27070.065 | 61.7381From stage 0 ensemble69.165 | 66.11971.090 | 67.0842From stage 1 ensemble69.936 | 67.44572.528 | 69.0353From stage 2 ens. (normalized)71.154 | 67.68371.716 | 69.527

Table 3 :3Competition results of the top 10 teams (with solution of team 'adsr' described in this paper)Rank Team Name on KaggleAUC [%]AUC [%](publ. LB)(priv. LB)1Team Kefir73.85769.0392adsr72.79469.0353NVBird74.21268.9974Team Cerberus74.69168.7775coolz74.39668.7176penguin4672.03968.7167Team Unicorn72.80968.3838kapenon69.66067.9289Aphysict71.45367.89110Tamo70.13267.623

Table 4 :4Single model parameters and performances of the 2 nd place ensembleParams. / Model ID123456seed424242427042n_folds5555105fold414404datasetbc24bc24bc24bc24bc24+bc24n_mels128128128646464hop_length5125121024102410241024image_height256256128646464image_width25625612812812864pseudoLabelChance [%]354045303025ampExpMin-0.5-1.0-0.5-0.5-0.5-0.5ampExpMax0.10.20.10.10.10.1Inference time~ 50 min. ~ 50 min. ~ 17 min. ~ 12 min. ~ 12 min. ~ 11 min.Public LB AUC [%]73.27071.97571.10469.93669.12469.309Private LB AUC [%]68.52168.53368.11667.44564.54365.862

Acknowledgements

I would like to thank Stefan Kahl, Holger Klinck, Maggie, Sohier Dane, Tom Denton, Vijay Ramesh, Maximilian Eibl, Chiti Arvind, Harikrishnan C.P., Viral Joshi, V.V. Robin, Suyash Sawant, Alexis Joly, Henning Müller, Divya Mudappa, T.R. Shankar Raman, Meghana Srivathsa, Akshay V. Anand, Willem-Pier Vellinga and all involved institutions and individual contributors (Kaggle, Chemnitz University of Technology, Columbia University, Google Research, Indian Institute of Science Education and Research Tirupati, K. Lisa Yang Center for Conservation Bioacoustics, LifeCLEF, Nature Conservation Foundation, Parry Agro Industries Ltd., Project Dhvani, Tamil Nadu Forest Department, Tata Coffee Ltd., Tea Estates India Ltd., The Rufford Foundation, The University of Florida and Xeno-canto) for organizing this competition.

I also want to thank the Museum für Naturkunde and the team of the Animal Sound Archive Berlin [54] in particular Karl-Heinz Frommolt, Olaf Jahn and Benjamin Werner for supporting my work. The research was partly funded by the BMEL (Bundesministerium für Ernährung und Landwirtschaft) within the project "Machbarkeitsstudie -Integration (bio-)akustischer Methoden zur Quantifizierung biologischer Vielfalt in das Waldmonitoring" (FKZ: 2221NR050B).

HKlinck MaggieDane SKahl SDenton T RameshV BirdCLEF 2024. 2024 Kaggle Overview of BirdCLEF 2024: Acoustic identification of under-studied bird species in the Western Ghats SKahl TDenton HKlinck VRamesh VJoshi MSrivathsa AAnand CArvind CPHarikrishnan SSawant RobinVv HGlotin HGoëau WPVellinga RPlanqué AJoly Working Notes of CLEF 2024 -Conference and Labs of the Evaluation Forum 2024 Overview of lifeclef 2024: Challenges on Species Distribution Prediction and Identification AJoly LPicek SKahl HGoëau VEspitalier CBotella BDeneu MarcosDEstopinan JLeblanc CLarcher TŠulc MHrúz MServajean M International Conference of the Cross-Language Evaluation Forum for European Languages Springer 2024. 2024 Working Notes of CLEF 2024 -Conference and Labs of the Evaluation Forum GFaggioli Ferro N, Galuščáková P, García Seco de Herrera A 2024 Experimental IR Meets Multilinguality, Multimodality, and Interaction LGoeuriot PMulhem GQuénot DSchwab LSoulier DiNunzio GMGaluščáková P GarcíaSeco De Herrera AFaggioli GFerro N Proceedings of the Fifteenth International Conference of the CLEF Association the Fifteenth International Conference of the CLEF Association

CLEF

2024. 2024 Audio based bird species identification using deep learning techniques ESprengel MJaggi YKilcher THofmann CEUR Workshop Proceedings 2016 Large-Scale Bird Sound Classification using Convolutional Neural Networks SKahl TWilhelm-Stein HHussein CEUR Workshop Proceedings 2017 Two Convolutional Neural Networks for Bird Detection in Audio Signals TGrill JSchlüter 25th European Signal Processing Conference 2017. EUSIPCO2017 <author> <persName><forename type="first">Greece</forename><surname>Kos</surname></persName> </author> <idno type="DOI">10.23919/EUSIPCO.2017.8081512</idno> <ptr target="https://doi.org/10.23919/EUSIPCO.2017.8081512" /> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b9"> <analytic> <title level="a" type="main">Audio bird classification with inception-v4 extended with time and timefrequency attention mechanisms ASevilla HGlotin CEUR Workshop Proceedings 2017 Automatic acoustic detection of birds through deep learning: the first Bird Audio Detection challenge DStowell YStylianou MWood HPamuła HGlotin Methods in Ecology and Evolution 2018 Audio-based Bird Species Identification with Deep Convolutional Neural Networks MLasseck CEUR Workshop Proceedings 2018 Acoustic Bird Detection with Deep Convolutional Neural Networks MLasseck Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018) MDPlumbley the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018) 2018 Tampere University of Technology Bird Species Identification in Soundscapes MLasseck CEUR Workshop Proceedings 2019 Bird Species Recognition using Convolutional Neural Networks with Attention on Frequency Bands MLasseck CEUR Workshop Proceedings 2023 Imagenet: A largescale hierarchical image database JDeng IEEE Conference on Computer Vision and Pattern Recognition 2009. 2009 Wortsman arXiv:2203.05482 Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time 2022 Evaluation of acoustic pattern recognition of nightingale (Luscinia megarhynchos) recordings by citizens MStehle MLasseck OKhorramshahi USturm 10.3897/rio.6.e50233 Research Ideas and Outcomes 6 e50233 2020 Towards a multisensor station for automated biodiversity monitoring JWWägele PBodesheim SJBourlat JDenzler 10.1016/j.baae.2022.01.003 Basic and Applied Ecology 59 2022 Weather stations for biodiversity: a comprehensive approach to an automated and modular monitoring system JWWägele GFTschan 10.3897/ab.e119534 Advanced Books

Sofia

Pensoft 2024