1. Introduction

Towards Deep Active Learning in Avian Bioacoustics

Lukas Rauch

Denis Huseljic

Moritz Wirth

Jens Decke

Bernhard Sick

Christoph Scholz

1 0 IEE , Fraunhofer Insitute, Kassel , Germany 1 IES, University of Kassel , Kassel , Germany

12 17

Passive acoustic monitoring (PAM) in avian bioacoustics enables cost-efective and extensive data collection with minimal disruption to natural habitats. Despite advancements in computational avian bioacoustics, deep learning models continue to encounter challenges in adapting to diverse environments in practical PAM scenarios. This is primarily due to the scarcity of annotations, which requires labor-intensive eforts from human experts. Active learning (AL) reduces annotation cost and speed ups adaption to diverse scenarios by querying the most informative instances for labeling. This paper outlines a deep AL approach, introduces key challenges, and conducts a small-scale pilot study.

eol>Deep Active Learning Avian Bioacoustics Passive Acoustic Monitoring

1. Introduction 2. Related Work

DL has enhanced bird species recognition from vocalizations in the context of biodiversity monitoring. Current SOTA approaches BirdNET [ 2 ], Google’s Perch [ 7, 1 ], and BirdSet [ 6 ] have set benchmarks in bird sound classification. While initial studies focused on model performance on focal recordings, research is increasingly shifting towards practical PAM scenarios [ 6 ]. In such environments, ARUs are proving efective for edge deployment for continuous soundscape analysis [ 8 ]. Research indicates that pre-trained models facilitate few-shot and transfer learning in data-scarce environments by providing valuable feature embeddings for rapid prototyping and eficient inference [ 3 ]. While deep AL is suited for quick model adaptation, its application in avian bioacoustics is still emerging. Bellafkir et al. [ 9 ] have integrated AL into edge-based systems for bird species identification, employing reliability scores and ensemble predictions to refine misclassifications through human feedback. This approach highlights the necessity for research into the application of deep AL and multi-label classification in avian bioacoustics. However, comparing these results is challenging because they utilize test datasets that are not publicly available and employ custom AL strategies [ 9 ].

3. Active Learning in Bird Sound Classification

Motivation. In PAM, a feature vector x ∈ represents a -dimensional instance, originating from either a focal recording where = ℱ , or a soundscape recording with = . Focal recordings are extensively available on the citizen-science platform Xeno-Canto (XC) [ 10 ] with a global collection of over 800,000 recordings, making them particularly suitable for model training. Large-scale bird sound classification models (e.g., BirdNET[ 2 ]) are primarily trained on focals. These multi-class recordings feature isolated bird vocalizations where each instance x is associated with a class label ∈ , where = {1, ..., }. The focal data distribution is denoted as Focal(x, ). However, annotations from XC often come with weak labels, lacking precise vocalization timestamps. As noted by Van Merriënboer et al. [ 11 ], evaluating on focals does not adequately reflect a model’s generalization performance in realworld PAM scenarios, rendering them unsuitable for assessing deployment capabilities. Soundscape recordings are passively recorded in specific regions, capturing the entire acoustic environment for PAM projects using static ARUs over extended periods. For instance, the High Sierra Nevada (HSN) [ 2 ] dataset includes long-duration soundscapes with precise labels and timestamps from multiple recording sites. Soundscapes are treated as multi-label tasks and are valuable for assessing model deployment in real-world PAM. Each instance x is associated with multiple class labels ∈ , represented by a one-hot encoded multi-label vector y = [1, . . . , ] ∈ [ 0, 1 ] . An instance can contain no bird sounds, represented by a zero-vector y = 0 ∈ R . Soundscapes’ limited scale and the extensive annotation efort make them less suitable for large-scale model training. We denote the soundscape data distribution as Scape(x, y). The disparity in data distributions, Scape(x, y) ̸= Focal(x, ), leads to a distribution shift that impacts the performance of SOTA bioacoustic models trained on focals when deployed in PAM. Additionally, highly diverse deployment conditions in PAM projects - such as background noise, recording devices, and their locations - also lead to domain diferences within and between soundscape recordings. These variations further highlight the need for compact models that can quickly and easily adapt to changing environments. Thus, we argue that using labeled soundscapes in novel deployment scenarios for fine-tuning the model is vital. Therefore, we propose deep AL to enable fast model adaption to various PAM scenarios.

Our approach. Our approach is detailed in Figure 1. We leverage the BirdSet dataset collection [ 6 ] to ensure comparability. We consider a multi-label classification problem, where we equip a model with a pre-trained feature extractor h : → R with parameters that maps the inputs x to feature embeddings h(x). Additionally, we utilize a classification head f : R → R with parameters at cycle iteration that maps the feature embeddings h(x) to class probabilities via the sigmoid function. The resulting class probabilities are denoted by pˆ = (f (h(x)), where pˆ ∈ R represents the probabilities for each class in a binary classification problem. We introduce a pool-based AL setting with an unlabeled pool () ⊆ and a labeled pool data set ℒ() ⊆ × . The pool consists of soundscapes from PAM projects, allowing the model to adapt to the unique acoustic features of new sites and improve performance across various scenarios. During each cycle iteration , the query strategy compiles the most informative instances into a batch ℬ() ⊂ () of size . We represent an annotated batch as ℬ* () ∈ × . We update the unlabeled pool (+1) = () ∖ ℬ() and the labeled pool ℒ(+1) = ℒ() ∪ ℬ* () by adding the annotated batch. At each iteration , the model is retrained using the binary cross entropy loss (x, y), resulting in the updated model parameters +1. The process continues until a budget is exhausted.

4. Experiments

Setup. We employ Google’s Perch as the pre-trained feature extractor with a feature dimensionality of = 1280, following Ghani et al. [ 3 ]. Each iteration of the AL cycle involves initializing and training the last DNN layer for 200 epochs using the Rectified Adam optimizer [ 12 ] (batch size: 128, learning rate: 0.05, weight decay: 0.0001) with a cosine annealing scheduler [ 13 ]. The hyperparameters are empirically determined with convergence on random train samples as done in [ 14 ]. We utilize the HSN dataset [ 15 ] from BirdSet [ 6 ], consisting of 5, 280 5-second soundscape segments from the initial three days of recordings for our unlabeled pool. Thus, we simulate practical deployment scenario where we initially collect data from various recording sites that we want to quickly adapt the model to and reduce annotation efort. Subsequently, we utilize 6, 720 segments from the last two days for testing model performance. Initially, 10 instances are selected randomly, followed by 50 iterations of =10 acquisitions each, totaling a budget of =510. We benchmark against Random acquisitions and use Typiclust [ 16 ] and Badge[ 17 ] as diversity-based and hybrid strategies, respectively. As an uncertainty-based strategy, we employ the mean Entropy of all binary predictions. The efectiveness of each strategy is assessed by analyzing the learning curves through a collection of threshold-free metrics [ 6 ]: T1-accuracy, class-based mean average precision (cmAP), and area under the receiver operating characteristic curve (AUROC). The metrics are computed on the test dataset post-training in each cycle, with learning curve improvements averaged over ten repetitions for consistency. Results. We present the improvement curves for the metric collection in Figure 2. The results demonstrate that no single strategy is universally superior across all metrics. However, nearly all metrics show enhanced performance compared to Random. Notably, Typiclust displays strong performance across all metrics at the start of the deep AL cycle, supporting the findings of [ 16 ] that a diverse selection is beneficial at the cycle’s onset. However, its efectiveness diminishes over time when diversity becomes less crucial. Conversely, except for the AUROC metric where Entropy initially performs poorly but strongly improves over time, Entropy outperforms in all iterations for cmAP and T1-Acc, showing a consistent improvement over Random of up to 15%.

5. Open Challenges and Limitations

This pilot study explores the use of deep AL to tailor avian bioacoustic models for various deployment scenarios in PAM. Although the initial results are encouraging, they remain preliminary. Several key om0.02 d n aR 0.00 o t e cn 0.02 e r e iffD 0.04

Badge 0.15 challenges, which are outlined below, need to be addressed to fully realize the potential of deep AL in this field.

Pool creation. The limited availability of soundscape data, which is primarily used for model evaluation [ 6 ], poses challenges in creating pool datasets for deep AL. The process of generating a fine-tuning training pool can afect class balance and raises concerns about the composition methodology. Additionally, in scenarios where data are sourced from PAM projects, the variability in recording sites is often not disclosed in publicly available datasets. This lack of information makes it challenging to create a diverse and representative training pool that takes recording locations into account. To efectively investigate deep AL, a transparent approach to dataset generation is essential.

Deployment in practice. Deploying deep AL in real-world PAM environments requires addressing several practical considerations. These include determining optimal batch sizes for data annotation and efectively allocating the total budget. The labor-intensive and costly process of labeling PAM recordings, which requires human expertise [ 18 ], highlights the need for accurately estimating the expected annotation efort. Additionally, exploring various deployment settings and tasks can reveal the versatility and potential challenges of applying deep AL, leading to more efective and scalable solutions for avian bioacoustics. For instance, tasks might involve not only classifying bird species but also identifying specific call densities [ 19 ], which would require modifications to the model evaluation process.

Evaluation. Traditional metrics such as AUROC, cmAP, and T1-Acc ofer a general overview of model performance but may be inadequate in practice-specific scenarios, such as ensuring a high recall of a specific species or identifying bird call density [ 19 ]. A more nuanced approach to evaluating deep AL models involves customizing metrics to align with practical objectives, such as consistently identifying specific species. Enhancing evaluation methodologies to capture these specialized requirements is crucial for advancing the efectiveness of deep AL in real-world PAM applications.

6. Conclusion

In this work, we demonstrated the potential of deep active learning (AL) in computational avian bioacoustics. We showed how deep AL can be integrated into real-world passive acoustic monitoring by utilizing BirdSet, where a rapid model adaption through fine-tuning on soundscape recordings is advantageous for the identification of bird species. Our results indicate that employing selection strategies in deep AL enhances model performance and accelerates adaptation compared to random sampling. For future work, we aim to expand the implementation of deep AL in avian bioacoustics utilizing all datasets from the BirdSet dataset collection to provide more robust performance insights and explore additional query strategies [ 13, 20 ]. 12–17

[1]

Hamer ,

Triantafillou ,

B. Van

Merriënboer ,

Kahl ,

Klinck ,

Denton ,

Dumoulin , BIRB: A Generalization Benchmark for Information Retrieval in Bioacoustics , CoRR ( 2023 ). URL: https: //doi.org/10.48550/arXiv.2312.07439.

[2]

Kahl ,

C. M.

Wood ,

Eibl , H. Klinck, BirdNET: A deep learning solution for avian diversity monitoring , Ecological Informatics 61 ( 2021 ) 101236 . URL: https://doi.org/10.1016/j.ecoinf. 2021 . 101236 .

[3]

Ghani ,

Denton ,

Kahl ,

Klinck , Feature Embeddings from Large-Scale Acoustic Bird Classifiers Enable Few-Shot Transfer Learning , CoRR ( 2023 ). doi: https://10.48550/arXiv. 2307.06292.

[4]

Decke ,

Gruhl ,

Rauch ,

Sick , DADO -- Low-cost query strategies for deep active design optimization , in: 2023 International Conference on Machine Learning and Applications (ICMLA) , IEEE, 2023 , pp. 1611 - 1618 .

[5]

Rauch ,

Schwinger ,

Wirth ,

Sick ,

Tomforde ,

Scholz , Active Bird2Vec: Towards End-to-End Bird Sound Monitoring with Transformers , CoRR ( 2023 ). URL: https://doi.org/10. 48550/arXiv.2308.07121.

[6]

Rauch ,

Schwinger ,

Wirth ,

Heinrich ,

Huseljic ,

Lange ,

Kahl ,

Sick ,

Tomforde ,

Scholz , Birdset: A dataset and benchmark for classification in avian bioacoustics , CoRR ( 2024 ). doi:https://10.48550/arXiv.2403.10380.

[7]

Denton ,

Wisdom ,

J. R.

Hershey , Improving Bird Classification with Unsupervised Sound Separation , in: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , IEEE, 2022 , pp. 636 - 640 . URL: https://doi.org/10.1109/ICASSP43922. 2022 . 9747202 .

[8]

Höchst ,

Bellafkir ,

Lampe ,

Vogelbacher ,

Mühling ,

Schneider ,

Lindner ,

Rösner ,

D. G.

Schabo ,

Farwig ,

Freisleben , Bird@Edge: Bird Species Recognition at the Edge , in: Networked Systems , volume 13464 , Cham , 2022 , pp. 69 - 86 . URL: https://doi.org/10.1007/ 978-3- 031 -17436- 0 _ 6 .

[9]

Bellafkir ,

Vogelbacher ,

Schneider ,

Mühling ,

Korfhage ,

Freisleben , Edge-Based Bird Species Recognition via Active Learning , in: Networked Systems , volume 14067 , Springer Nature Switzerland, Cham, 2023 , pp. 17 - 34 . doi: 10 .1007/978-3- 031 -37765- 5 _ 2 .

[10]

Vellinga ,

Planqué , The xeno-canto collection and its relation to sound recognition and classification, CEUR-WS .org, 2015 . URL: https://xeno-canto.org/.

[11] B. Van Merriënboer ,

Hamer ,

Dumoulin , E. Triantafillou, T. Denton, Birds, Bats and beyond: Evaluating generalization in bioacoustic models , CoRR ( 2024 ).

[12]

Liu ,

Jiang ,

He ,

Chen ,

Liu ,

Gao , J. Han, On the variance of the adaptive learning rate and beyond , in: International Conference on Learning Representations, 2019 .

[13]

Huseljic ,

Hahn ,

Herde ,

Rauch ,

Sick , Fast fishing: Approximating bait for eficient and scalable deep active image classification , CoRR ( 2024 ). doi: https://10.48550/arXiv.2404. 08981.

[14]

Huseljic ,

Herde ,

Hahn ,

Sick , Role of hyperparameters in deep active learning , in: Workshop on Interactive Adaptive Learning @ ECML PKDD , 2023 , pp. 19 - 24 .

[15]

Kahl ,

C. M.

Wood ,

Chaon ,

M. Z.

Peery ,

Klinck , A collection of fully-annotated soundscape recordings from the western united states , 2022 . URL: https://doi.org/10.5281/zenodo.7050014.

[16]

Hacohen ,

Dekel ,

Weinshall , Active learning on a budget: Opposite strategies suit high and low budgets , in: International Conference on Machine Learning , 2022 .

[17]

J. T.

Ash ,

Zhang ,

Krishnamurthy ,

Langford ,

Agarwal , Deep batch active learning by diverse, uncertain gradient lower bounds , in: International Conference on Learning Representations , 2020 .

[18]

Stowell , Computational bioacoustics with deep learning: A review and roadmap , CoRR ( 2021 ). URL: https://doi.org/10.48550/arXiv.2112.06725.

[19] A. K. Navine , T.

Denton , M. J.

Weldy , P. J.

Hart , All thresholds barred: Direct estimation of call density in bioacoustic data , Frontiers in Bird Science 3 ( 2024 ). doi: 10 .3389/fbirs. 2024 . 1380636 .

[20]

Rauch ,

Aßenmacher ,

Huseljic ,

Wirth ,

Bischl ,

Sick , Activeglae: A benchmark for deep active learning with transformers , in: Machine Learning and Knowledge Discovery in Databases: Research Track , Springer Nature Switzerland, 2023 , p. 55 - 74 . URL: https://doi.org/10. 1007/978-3- 031 -43412- 9 _ 4 .