Few-shot Long-Tailed Bird Audio Recognition Marcos V. Conde1,2,† , Ui-Jin Choi3,*,† 1 H2O.ai 2 Computer Vision Lab, Institute of Computer Science, University of Würzburg, Germany 3 MegaStudyEdu, South Korea Abstract It is easier to hear birds than see them. However, they still play an essential role in nature and are excellent indicators of deteriorating environmental quality and pollution. Recent advances in Deep Neural Networks allow us to process audio data to detect and classify birds. This technology can assist researchers in monitoring bird populations and biodiversity. We propose a sound detection and classification pipeline to analyze complex soundscape recordings and identify birdcalls in the background. Our method learns from weak labels and few data and acoustically recognizes the bird species. Our solution achieved 18th place of 807 teams at the BirdCLEF 2022 Challenge hosted on Kaggle. Code and models will be open-sourced at https://github.com/Choiuijin1125/bclef2022. Keywords BirdCLEF2022, LifeCLEF2022, Deep Learning, Sound Event Detection, Audio Recognition, CNN 1. Introduction The BirdCLEF 2022 Challenge [1, 2] proposes to identify which birds are calling in long record- ings given quite limited training data. This is the exact challenge faced by scientists trying to monitor rare birds in Hawaii. However, we propose a novel machine learning solution to help advance the science of bioacoustics and support ongoing research to protect endangered Hawaiian birds. The motivation behind this challenge and our solution is the fact that Hawaii has lost 68% of its bird species [3]. Researchers use population bioacoustic monitoring to understand how native birds react to changes in the environment and conservation efforts. This approach could provide passive, low labor, and cost-effective strategy for studying endangered bird populations. Current methods for processing large bioacoustic datasets involve manual annotation of each recording. This is an expensive process that requires specialized training and large amounts of time. For this reason, we propose a Machine Learning solution to automatically identify bird species in long audio recordings via birdcall detection and classification within the audio. CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy * Corresponding author. † Authors contributed equally. $ marcos.conde-osorio@uni-wuerzburg.de (M. V. Conde); choiuijin1125@megastudy.net (U. Choi) € https://mv-lab.github.io/ (M. V. Conde) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) ’Akiapōlā’au ’Ākohekohe ’I’iwi Nēnē Hemignathus wilsoni Palmeria dolei Drepanis coccinea Branta sandvicensis Figure 1: Photographs of some Hawai’i endemic bird species studied in this work. Photo credit: Amanda K. Navine, Alexander Wang and Ann Tanimoto-Johnson. 1.1. Related Work Recent advances in Machine Learning (ML) have made it possible to automatically identify bird songs for common species using annotated soundscapes as training data. The main challenges from the machine learning point of view are: 1. Weak labels. Training data consists of soundscapes of variable duration, recorded in the wild. Therefore, we find substantial noise in the recordings (other birds besides the target, rain, wind, planes, etc). 2. Long-tailed distribution. Rare and endangered species (such as those in Hawaii) are less represented in the training data, and therefore, the model struggles to learn their features and generalize for those classes. 3. Few-shot training is required. We find less than four recordings for some endemic bird species (crehon, hawhaw, maupar, etc). The most represented bird is ”skylar" with 500 recordings, which is not a tremendous amount of training data in the context of ML. Previous years BirdCLEF challenges [4, 5] proposed different problems related to large- scale bird recognition in soundscapes or complex acoustic environments. Sprengel et.al. [6] and Lasseck [7, 8] introduced deep learning techniques for the ”Bird species identification in soundscapes" problem. State-of-the-art (SOTA) solutions are based on Deep Convolutional Neural Networks (CNNs) [9, 10, 11], usually, deep CNNs with attention mechanisms are selected as backbone in these experiments [12, 13, 14, 15], or suitable for fine-grained classification tasks [16]. Pretrained audio neural networks (PANNs) [14] provide a multi-task SOTA baseline for audio related tasks, showing great generalization capability. Other approaches are focused on Sound Event Detection (SED) [17, 18, 14, 19], similar to video understanding [20], these approaches usually employ 2D CNNs to extract useful features from the input audio signal (log-melspectrogram), these features still contain information about frequency and time, then recurrent neural networks (RNNs) are used to model longer temporal context from the extracted features or use the feature map directly to predict since it preserves time segment information. Solutions for the BirdCLEF 2021 Challenge follow these directions, moreover, they propose additional post-processing techniques to eliminate false detections (FPs) [21], divers CNN- based ensembles [22, 23, 24, 25, 26, 27], and transformer-based solutions like STFT [28]. These solutions can identify birds in long audio recordings, at different locations (Colombia, USA, and Costa Rica), with ≈ 70% accuracy. In this challenge, we focus only on Hawaiian Bird Species. We define some terms related to this challenge referring last year competition solution [23, 22] that we will use in the description of our method in Section 2: • Leaderboard denoted as LB (including its two variants, public and private) • Cross-Validation denoted as CV. • We define ”nocall" as the class corresponding to the events (a.k.a segments or clips) in audio where birdcalls are not detected. • We refer to the ”BirdCLEF 2021 Birdcall Identification Challenge (Kaggle)" as the "previous or last competition". • We define ”weakly labeled" as the labels that do not contain time-wise information about bird species in audio clips (i.e., not specific information about in which 5s segment in the audio, the bird calls). • We define ”strongly labeled" as the labels that contain time-wise information about bird species in audio clips (i.e., approximate second within the audio, where the bird calls). • BirdCLEF 2021 train soundscapes audios denoted as ”train soundscapes" which are 20 audio clips strongly labeled. 1.2. Dataset The training set consists of short audio recordings of 152 bird species, and only 21 bird species of interest are scored. These bird species inhabit Hawaii. However, many of the remaining birds across the islands are isolated in difficult-to-access, high-elevation habitats. Therefore, physical monitoring is difficult, and scientists have turned to sound recordings. As we show in Figure 2, the distribution of the ”interesting" bird species is very long-tailed[29], making it necessary to deal with extreme class imbalance. As we introduced, in this competition, our challenge is to develop ML models to identify bird species using sounds. Such models have to deal with real-world problems such as long-tailed rare birds and weak-noisy labels. Figure 2: Distribution of bird species in the training set. We can see a notable long-tailed distribution. Many bird species are represented with less than 10 audio clips (i.e., maupar, crehon, hawhaw, puaioh). The red line indicates that most of the birds appear less than 50 times in the training set. Only two birds (houfin and skylar) appear more than 100 times in the training data. 1.3. Evaluation The performance is measured using a custom metric that is most similar to the “macro F1 score". The test set consists of approximately 5500 recordings. Participants submit code and models and never have access to the test audios. There is a public LB that shows the score corresponding to 16% of the test (880 audios), and a private or final LB with the scores over the rest 84%. 2. Methods 2.1. Preprocessing Previous BirdCLEF challenges [4, 5] showed that long audio clips for training improve perfor- mance. For this reason, we randomly cropped a 30-second time window of each audio, next, we split the 30s audio clips into 5s and 6-parts chucks as proposed by Henkel et al. [22], finally we transformed the such chunks to Mel Spectrogram using torchaudio library. The spectrograms were generated using the following parameters: sample_rate=32000, n_mels=128, fmax=14000, fmin=50 hop_size=512, hop_size=512,top_db=None. 2.2. Augmentations After splitting audios, we applied 3 types of augmentations to handle robustness and the long-tailed distribution problem. First, we used three external datasets freefield1010[30], BirdVox-DCASE-20k[31], train_soundscapes (from 2021 Challenge) for background noise. Second, to handle class imbalance, we used selective mixup [32] which only uses the 21 scored birds of interest in the audio clip. In every training batch, we fed randomly cropped scored birds Mel Spectrograms and used mixup with training data. Next, we applied spec-augmentations[33]. This method showed good performance in our local validation (CV), especially selective mixup boosted + 0.03 our score. However, we also observed an overfitting behavior in some classes. because some scored birds (21 classes as in Figure 2) have a long-tail distribution, model tends to predict more high confidence scores for some birds like a “skylar" and “houfin" which has a large distribution. Figure 3 shows the pre-processing pipeline and augmentations. 2.3. Modeling We used 9 different backbones and 22 models. Our models are a combination of top solutions from previous competition [4]. The main architectures are CNNs backbones with Sound Event Detection heads [19], which showed good performance in the previous challenges [4]. We feed 5-second, 6-part Mel Spectrograms into the network as [22]. Figure 4 shows the pipeline of our models. We also tried ConformerSED[34], FDY-SED[35], HTS-AT[36] but the results were much worse than using well-known CNN approaches. We show the different backbones in Table 1. In particular we focus on tf_efficientnet_b0_ns since it is light-weight and suitable for smartphone devices, and its performance is consistently competitive. Figure 3: Illustration of our pre-processing pipeline. In every training batch, we used mixup of ”scored" birds. We can expect that selective mixup can make very robust training for rare birds [22]. Figure 4: Illustration of our Bird Classification model inspired in [22]. 2.4. Training We trained our models using focal binary cross-entropy loss, AdamW optimizer, and cosine an- nealing scheduler with batch size 24. We also used the quality rating, which is meta information on audio quality. While computing the loss, we weight it using the normalized quality rating, and we used one-sided label smoothing, adding 0.01 across all negative labels. Both methods were proposed by Henkel et al. [22] and improved our performance consistently. 2.5. Post-Processing 2.5.1. Penalization We observed that our models are biased, and tend to predict with high confidence scores the most represented birds (i.e. skylar and houfin). This implies a large number of False Positives (FP) and misclassified clips. We give a penalty score depending on the distribution of the birds, such that most represented birds are more penalized. Penalization (PN) can be explained as in Eq.1 where x is the distribution of each bird. We used penalty factor=0.8. Penalization is not a realistic technique, the model should not filter out bird species in that way, yet, in this scenario it boosted our score on the public LB. As a better alternative, we aim to make our method less sensitive to the data distribution and robust against background birdcalls from non-interest birds, we tried to find class-wise thresholds of each bird species. 𝑥𝑖 p𝑖 = p𝑖 − penalty factor × ∑︀𝑛 (1) 𝑖 𝑥𝑖 2.5.2. Class-Wise Thresholds We observed that if there is a birdcall in the audio clips, regardless of the bird species, our models show higher confidence scores. We used train_soundscape audio clips to validate nocall thresholds for bird species using AUC score per each bird (as a binary classification problem call/nocall). Even though there is no label for scored birds in train_soundscape, we can estimate the appropriate nocall/birdcall thresholds for each bird. We used a grid search method to find the best nocall quantile-based threshold of each bird, such that we achieve the maximum AUC score per each bird, or in other words, such that we can distinguish better birdcalls from noise or background, independent from the bird species present in the audio. Figure 5 shows the distribution of probabilities using train_soundscapes. This Class-wise (CW) post-processing method further boosted our score in comparison to Penalization and is more robust. Figure 5: Distributions of nocall probability validated using train_soundscapes. We show the class-wise best quantile thresholds to obtain the maximum AUC score per bird. 3. Results and discussion Table 1 summarizes our experiments. We tested 9 different CNN backbones. We found difficult to calibrate thresholds using an ensemble of models, yet, we used quantile-based thresholds on the ensemble predictions. Penalization (PN) showed good performance in Public LB. However, penalizing common birds as skylar or houfin that most probably appear in most of the audios is not realistic. On the other hand, the Class-wise (CW) method showed better performance in general, and it is robust to background birds. We find that calibration of thresholds is very sensitive because there are very few rare birds in audio clips and the real world. Our results imply that we can find proper thresholds for each rare bird using nocall/birdcall validation and a quantile-based approach without strongly labeled data, as we show in Figure 5. We also provide qualitative Grad-CAM [37] results of our model tf_efficientnet_b0_ns in Figure 6, which shows how our model is able to learn and focus on particular frequencies and segments within the audio, and it is robust against background noise. Table 1 Experiments result of models. For local validation, we used "micro F1-score" and train soundscapes. We highlight in blue our top-3 models, in yellow the results of our final submission ensemble, and in green the top solutions in the challenge LB. We also distinguish two post-processing methods: PN and CW. Contemporary results from other competitors can be found at [38, 39, 40, 41]. Backbone CV Public LB Private LB Post-Proc. tf_efficientnet_b0_ns [13] 0.8745 0.7922 0.7240 PN tf_efficientnet_b0_ns [13] 0.8745 0.7817 0.7548 CW eca_nfnet_l0 [42] 0.8761 0.7510 0.7387 CW resnest50d [12] 0.8822 0.7550 0.7372 CW tf_efficientnet_b1_ns [13] 0.7843 0.7395 0.6946 CW tf_efficientnet_b2_ns [13] 0.8848 0.7277 0.7046 CW tf_efficientnet_b3_ns [13] 0.8561 0.7262 0.6640 CW tf_efficientnetv2_s_in21k [43] 0.8632 0.7620 0.7439 CW tf_efficientnetv2_b0 [43] 0.8762 0.7268 0.7070 CW Ours Ensemble - 0.7971 0.7733 CW Ours Ensemble - 0.8359 0.7630 PN 1st Place - 0.8953 0.8527 2nd Place - 0.9128 0.8438 3rd Place - 0.8750 0.8126 BirdNet [44] - 0.85 0.78 4. Conclusion We hope our work can help researchers and conservation practitioners accurately survey population trends, so they can regularly and more effectively evaluate threats. We present a sound detection and classification pipeline for analyzing soundscape recordings. Our models learn from few data and weak labels; they can accurately classify fine-grained bird vocalizations in 0.04s using a single GPU. Moreover, they show robustness against noisy sounds (e.g., rain, cars). We aim to improve the model’s efficiency for smartphone devices applications. Figure 6: Grad-CAM [37] activations from our model tf_efficientnet_b0_ns on different validation audio spectrograms. These qualitative results show how our model focuses on particular frequencies through time to recognize the birds. Best viewed in electronic version. Acknowledgments Marcos Conde is supported by H2O.ai and by Humboldt Foundation (JMU Würzburg). We would like to thank Kaggle and Dr. Stefan Kahl for hosting the BirdCLEF 2022 Challenge. We also want to thank the contributions from: Amanda K. Navine, Ann Tanimoto-Johnson, Hidehisa Arai, Christof Henkel, Pascal Pfeiffer, and Philipp Singer. References [1] S. Kahl, A. Navine, T. Denton, H. Klinck, P. Hart, H. Glotin, H. Goëau, W.-P. Vellinga, R. Planqué, A. Joly, Overview of birdclef 2022: Endangered bird species recognition in soundscape recordings, Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum (2022). [2] A. Joly, H. Goëau, S. Kahl, L. Picek, T. Lorieul, E. Cole, B. Deneu, M. Servajean, A. Durso, I. Bolon, H. Glotin, R. Planqué, W.-P. Vellinga, A. Navine, H. Klinck, T. Denton, I. Eggel, P. Bonnet, M. Šulc, H. Müller, Overview of lifeclef 2022: an evaluation of machine- learning based species identification and species distribution prediction, in: International Conference of the Cross-Language Evaluation Forum for European Languages, Springer, 2022. [3] S. Kahl, Kaggle, Birdclef 2022, https://www.kaggle.com/competitions/birdclef-2022/, 2022. Accessed: 2022-06-03. [4] S. Kahl, T. Denton, H. Klinck, H. Glotin, H. Goëau, W.-P. Vellinga, R. Planqué, A. Joly, Overview of BirdCLEF 2021: Bird call identification in soundscape recordings, in: Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum, 2021, pp. 1437–1450. [5] S. Kahl, M. Clapp, W. Hopping, H. Goëau, H. Glotin, R. Planqué, W.-P. Vellinga, A. Joly, Overview of BirdCLEF 2020: Bird Sound Recognition in Complex Acoustic Environments, in: Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, 2020. [6] E. Sprengel, M. Jaggi, Y. Kilcher, T. Hofmann, Audio based bird species identification using deep learning techniques, in: CLEF, 2016. [7] M. Lasseck, Bird species identification in soundscapes, in: CLEF, 2019. [8] M. Lasseck, Audio-based bird species identification with deep convolutional neural net- works, in: CLEF, 2018. [9] J. Schlüter, Bird identification from timestamped, geotagged audio recordings, in: CLEF, 2018. [10] J. Bai, C. Chen, J. Chen, Xception based method for bird sound recognition of birdclef 2020, in: CLEF, 2020. [11] M. Mühling, J. Franz, N. Korfhage, B. Freisleben, Bird species recognition via neural architecture search, in: CLEF, 2020. [12] H. Zhang, C. Wu, Z. Zhang, Y. Zhu, H. Lin, Z. Zhang, Y. Sun, T. He, J. Mueller, R. Manmatha, M. Li, A. Smola, Resnest: Split-attention networks, 2020. arXiv:2004.08955. [13] M. Tan, Q. V. Le, Efficientnet: Rethinking model scaling for convolutional neural networks, 2020. arXiv:1905.11946. [14] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, M. D. Plumbley, Panns: Large-scale pretrained audio neural networks for audio pattern recognition, 2020. arXiv:1912.10211. [15] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, 2015. arXiv:1512.03385. [16] M. V. Conde, K. Turgutlu, Exploring vision transformers for fine-grained classification, arXiv preprint arXiv:2106.10587 (2021). [17] K. Drossos, S. I. Mimilakis, S. Gharib, Y. Li, T. Virtanen, Sound event detection with depthwise separable and dilated convolutions, in: 2020 International Joint Conference on Neural Networks (IJCNN), IEEE, 2020, pp. 1–7. [18] E. Fonseca, M. Plakal, D. P. W. Ellis, F. Font, X. Favory, X. Serra, Learning sound event classifiers from web audio with noisy labels, 2019. arXiv:1901.01189. [19] V. Lostanlen, J. Salamon, A. Farnsworth, S. Kelling, J. Bello, Robust sound event detection in bioacoustic sensor networks, PLOS ONE 14 (2019) e0214168. doi:10.1371/journal. pone.0214168. [20] L. Zhang, S. Nizampatnam, A. Gangopadhyay, M. V. Conde, Multi-attention networks for temporal localization of video-level labels, arXiv preprint arXiv:1911.06866 (2019). [21] N. Murakami, H. Tanaka, M. Nishimori, Birdcall identification using CNN and gradient boosting decision trees with weak and noisy supervision, in: Proceedings of the Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum, Bucharest, Romania, September 21st - to - 24th, 2021, volume 2936 of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 1597–1608. URL: http://ceur-ws.org/Vol-2936/paper-136.pdf. [22] C. Henkel, P. Pfeiffer, P. Singer, Recognizing bird species in diverse soundscapes under weak supervision, 2021. URL: https://arxiv.org/abs/2107.07728. doi:10.48550/ARXIV. 2107.07728. [23] M. V. Conde, K. Shubham, P. Agnihotri, N. D. Movva, S. Bessenyei, Weakly-supervised classification and detection of bird sounds in the wild. a birdclef 2021 solution, in: CLEF, CEUR-WS.org, 2021, pp. 1547–1558. URL: http://ceur-ws.org/Vol-2936/paper-131.pdf. [24] G. Das, S. Aggarwal, Bird-species audio identification, ensembling 1d + 2d signals, in: Proceedings of the Working Notes of CLEF 2021, 2021. [25] A. S. Kumar, D. Kowerko, Tuc media computing at birdclef 2021: Noise augmentation strategies in bird sound classification in combination with densenets and resnets, in: Pro- ceedings of the Working Notes of CLEF 2021, volume 2936 of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 1617–1626. URL: http://ceur-ws.org/Vol-2936/paper-138.pdf. [26] M. V. Shugaev, N. Tanahashi, P. Dhingra, U. Patel, Birdclef 2021: building a birdcall segmentation model based on weak labels, in: Proceedings of the Working Notes of CLEF 2021, 2021. [27] J. Schlüter, Learning to monitor birdcalls from weakly-labeled focused recordings, in: G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi (Eds.), Proceedings of the Working Notes of CLEF 2021, volume 2936 of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 1627–1638. URL: http://ceur-ws.org/Vol-2936/paper-139.pdf. [28] J.-F. Puget, Stft transformers for bird song recognition, in: CLEF, CEUR-WS.org, 2021, pp. 1609–1616. URL: http://ceur-ws.org/Vol-2936/paper-137.pdff. [29] T. Weyand, A. Araujo, B. Cao, J. Sim, Google landmarks dataset v2 – a large-scale bench- mark for instance-level recognition and retrieval, 2020. URL: https://arxiv.org/abs/2004. 01804. doi:10.48550/ARXIV.2004.01804. [30] D. Stowell, M. D. Plumbley, An open dataset for research on audio field recording archives: freefield1010, 2013. URL: https://arxiv.org/abs/1309.5275. doi:10.48550/ARXIV.1309. 5275. [31] V. Lostanlen, J. Salamon, A. Farnsworth, S. Kelling, J. P. Bello, BirdVox-DCASE-20k: a dataset for bird audio detection in 10-second clips, 2018. URL: https://doi.org/10.5281/ zenodo.1208080. doi:10.5281/zenodo.1208080. [32] H. Zhang, M. Cisse, Y. N. Dauphin, D. Lopez-Paz, mixup: Beyond empirical risk minimiza- tion, 2017. URL: https://arxiv.org/abs/1710.09412. doi:10.48550/ARXIV.1710.09412. [33] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, Q. V. Le, SpecAug- ment: A Simple Data Augmentation Method for Automatic Speech Recognition, in: Proc. Interspeech 2019, 2019, pp. 2613–2617. doi:10.21437/Interspeech.2019-2680. [34] K. Miyazaki, T. Komatsu, T. Hayashi, S. Watanabe, T. Toda, K. Takeda, Conformer-based sound event detection with semi-supervised learning and data augmentation, DCASE2020 Workshop (2020). [35] H. Nam, S.-H. Kim, B.-Y. Ko, Y.-H. Park, Frequency dynamic convolution: Frequency- adaptive pattern recognition for sound event detection, 2022. URL: https://arxiv.org/abs/ 2203.15296. doi:10.48550/ARXIV.2203.15296. [36] K. Chen, X. Du, B. Zhu, Z. Ma, T. Berg-Kirkpatrick, S. Dubnov, Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection, 2022. URL: https: //arxiv.org/abs/2202.00874. doi:10.48550/ARXIV.2202.00874. [37] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam: Visual explanations from deep networks via gradient-based localization, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 618–626. [38] S. Krishnan, P. Khandelwal, R. Garg, Bird Species Classification: One Step at a Time, in: CLEF Working Notes 2022, CLEF: Conference and Labs of the Evaluation Forum, Sep. 2022, Bologna, Italy, 2022. [39] E. Martynov, Y. Uematsu, Dealing with Class Imbalance in Bird Sound Classification, in: CLEF Working Notes 2022, CLEF: Conference and Labs of the Evaluation Forum, Sep. 2022, Bologna, Italy, 2022. [40] A. Miyaguchi, J. Yu, B. Cheungvivatpant, D. Dudley, A. Swain, Motif Mining and Unsuper- vised Representation Learning for BirdCLEF 2022, in: CLEF Working Notes 2022, CLEF: Conference and Labs of the Evaluation Forum, Sep. 2022, Bologna, Italy, 2022. [41] A. Sampathkumar, D. Kowerko, TUC Media Computing at BirdCLEF 2022: Strategies in identifying bird sounds in a complex acoustic environment, in: CLEF Working Notes 2022, CLEF: Conference and Labs of the Evaluation Forum, Sep. 2022, Bologna, Italy, 2022. [42] A. Brock, S. De, S. L. Smith, K. Simonyan, High-performance large-scale image recognition without normalization, in: International Conference on Machine Learning, PMLR, 2021, pp. 1059–1071. [43] M. Tan, Q. Le, Efficientnetv2: Smaller models and faster training, in: International Conference on Machine Learning, PMLR, 2021, pp. 10096–10106. [44] S. Kahl, C. M. Wood, M. Eibl, H. Klinck, Birdnet: A deep learning solution for avian diversity monitoring, Ecological Informatics 61 (2021) 101236.