=Paper= {{Paper |id=Vol-3180/paper-161 |storemode=property |title=Few-shot Long-Tailed Bird Audio Recognition |pdfUrl=https://ceur-ws.org/Vol-3180/paper-161.pdf |volume=Vol-3180 |authors=Marcos V. Conde,Ui-Jin Choi |dblpUrl=https://dblp.org/rec/conf/clef/CondeC22 }} ==Few-shot Long-Tailed Bird Audio Recognition== https://ceur-ws.org/Vol-3180/paper-161.pdf
Few-shot Long-Tailed Bird Audio Recognition
Marcos V. Conde1,2,† , Ui-Jin Choi3,*,†
1
  H2O.ai
2
  Computer Vision Lab, Institute of Computer Science, University of Würzburg, Germany
3
  MegaStudyEdu, South Korea


                                         Abstract
                                         It is easier to hear birds than see them. However, they still play an essential role in nature and are
                                         excellent indicators of deteriorating environmental quality and pollution. Recent advances in Deep
                                         Neural Networks allow us to process audio data to detect and classify birds. This technology can
                                         assist researchers in monitoring bird populations and biodiversity. We propose a sound detection and
                                         classification pipeline to analyze complex soundscape recordings and identify birdcalls in the background.
                                         Our method learns from weak labels and few data and acoustically recognizes the bird species. Our
                                         solution achieved 18th place of 807 teams at the BirdCLEF 2022 Challenge hosted on Kaggle.
                                         Code and models will be open-sourced at https://github.com/Choiuijin1125/bclef2022.

                                         Keywords
                                         BirdCLEF2022, LifeCLEF2022, Deep Learning, Sound Event Detection, Audio Recognition, CNN




1. Introduction
The BirdCLEF 2022 Challenge [1, 2] proposes to identify which birds are calling in long record-
ings given quite limited training data. This is the exact challenge faced by scientists trying
to monitor rare birds in Hawaii. However, we propose a novel machine learning solution to
help advance the science of bioacoustics and support ongoing research to protect endangered
Hawaiian birds.
   The motivation behind this challenge and our solution is the fact that Hawaii has lost 68%
of its bird species [3].
   Researchers use population bioacoustic monitoring to understand how native birds react to
changes in the environment and conservation efforts. This approach could provide passive, low
labor, and cost-effective strategy for studying endangered bird populations. Current methods
for processing large bioacoustic datasets involve manual annotation of each recording. This
is an expensive process that requires specialized training and large amounts of time. For this
reason, we propose a Machine Learning solution to automatically identify bird species in long
audio recordings via birdcall detection and classification within the audio.


CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
*
  Corresponding author.
†
  Authors contributed equally.
$ marcos.conde-osorio@uni-wuerzburg.de (M. V. Conde); choiuijin1125@megastudy.net (U. Choi)
€ https://mv-lab.github.io/ (M. V. Conde)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
     ’Akiapōlā’au              ’Ākohekohe                 ’I’iwi                     Nēnē
  Hemignathus wilsoni           Palmeria dolei          Drepanis coccinea        Branta sandvicensis
Figure 1: Photographs of some Hawai’i endemic bird species studied in this work. Photo credit: Amanda
K. Navine, Alexander Wang and Ann Tanimoto-Johnson.


1.1. Related Work
Recent advances in Machine Learning (ML) have made it possible to automatically identify bird
songs for common species using annotated soundscapes as training data.
  The main challenges from the machine learning point of view are:
   1. Weak labels. Training data consists of soundscapes of variable duration, recorded in the
      wild. Therefore, we find substantial noise in the recordings (other birds besides the target,
      rain, wind, planes, etc).
   2. Long-tailed distribution. Rare and endangered species (such as those in Hawaii) are less
      represented in the training data, and therefore, the model struggles to learn their features
      and generalize for those classes.
   3. Few-shot training is required. We find less than four recordings for some endemic bird
      species (crehon, hawhaw, maupar, etc). The most represented bird is ”skylar" with 500
      recordings, which is not a tremendous amount of training data in the context of ML.
   Previous years BirdCLEF challenges [4, 5] proposed different problems related to large-
scale bird recognition in soundscapes or complex acoustic environments. Sprengel et.al. [6]
and Lasseck [7, 8] introduced deep learning techniques for the ”Bird species identification in
soundscapes" problem. State-of-the-art (SOTA) solutions are based on Deep Convolutional
Neural Networks (CNNs) [9, 10, 11], usually, deep CNNs with attention mechanisms are selected
as backbone in these experiments [12, 13, 14, 15], or suitable for fine-grained classification
tasks [16]. Pretrained audio neural networks (PANNs) [14] provide a multi-task SOTA baseline
for audio related tasks, showing great generalization capability. Other approaches are focused
on Sound Event Detection (SED) [17, 18, 14, 19], similar to video understanding [20], these
approaches usually employ 2D CNNs to extract useful features from the input audio signal
(log-melspectrogram), these features still contain information about frequency and time, then
recurrent neural networks (RNNs) are used to model longer temporal context from the extracted
features or use the feature map directly to predict since it preserves time segment information.
   Solutions for the BirdCLEF 2021 Challenge follow these directions, moreover, they propose
additional post-processing techniques to eliminate false detections (FPs) [21], divers CNN-
based ensembles [22, 23, 24, 25, 26, 27], and transformer-based solutions like STFT [28]. These
solutions can identify birds in long audio recordings, at different locations (Colombia, USA, and
Costa Rica), with ≈ 70% accuracy. In this challenge, we focus only on Hawaiian Bird Species.
  We define some terms related to this challenge referring last year competition solution [23, 22]
that we will use in the description of our method in Section 2:

    • Leaderboard denoted as LB (including its two variants, public and private)
    • Cross-Validation denoted as CV.
    • We define ”nocall" as the class corresponding to the events (a.k.a segments or clips) in
      audio where birdcalls are not detected.
    • We refer to the ”BirdCLEF 2021 Birdcall Identification Challenge (Kaggle)" as the "previous
      or last competition".
    • We define ”weakly labeled" as the labels that do not contain time-wise information about
      bird species in audio clips (i.e., not specific information about in which 5s segment in the
      audio, the bird calls).
    • We define ”strongly labeled" as the labels that contain time-wise information about bird
      species in audio clips (i.e., approximate second within the audio, where the bird calls).
    • BirdCLEF 2021 train soundscapes audios denoted as ”train soundscapes" which are 20
      audio clips strongly labeled.

1.2. Dataset
The training set consists of short audio recordings of 152 bird species, and only 21 bird species
of interest are scored. These bird species inhabit Hawaii. However, many of the remaining birds
across the islands are isolated in difficult-to-access, high-elevation habitats. Therefore, physical
monitoring is difficult, and scientists have turned to sound recordings. As we show in Figure 2,
the distribution of the ”interesting" bird species is very long-tailed[29], making it necessary
to deal with extreme class imbalance. As we introduced, in this competition, our challenge is
to develop ML models to identify bird species using sounds. Such models have to deal with
real-world problems such as long-tailed rare birds and weak-noisy labels.




Figure 2: Distribution of bird species in the training set. We can see a notable long-tailed distribution.
Many bird species are represented with less than 10 audio clips (i.e., maupar, crehon, hawhaw, puaioh).
The red line indicates that most of the birds appear less than 50 times in the training set. Only two birds
(houfin and skylar) appear more than 100 times in the training data.
1.3. Evaluation
The performance is measured using a custom metric that is most similar to the “macro F1 score".
The test set consists of approximately 5500 recordings. Participants submit code and models and
never have access to the test audios. There is a public LB that shows the score corresponding to
16% of the test (880 audios), and a private or final LB with the scores over the rest 84%.


2. Methods
2.1. Preprocessing
Previous BirdCLEF challenges [4, 5] showed that long audio clips for training improve perfor-
mance. For this reason, we randomly cropped a 30-second time window of each audio, next, we
split the 30s audio clips into 5s and 6-parts chucks as proposed by Henkel et al. [22], finally we
transformed the such chunks to Mel Spectrogram using torchaudio library. The spectrograms
were generated using the following parameters: sample_rate=32000, n_mels=128, fmax=14000,
fmin=50 hop_size=512, hop_size=512,top_db=None.

2.2. Augmentations
After splitting audios, we applied 3 types of augmentations to handle robustness and the
long-tailed distribution problem. First, we used three external datasets freefield1010[30],
BirdVox-DCASE-20k[31], train_soundscapes (from 2021 Challenge) for background noise.
   Second, to handle class imbalance, we used selective mixup [32] which only uses the 21 scored
birds of interest in the audio clip. In every training batch, we fed randomly cropped scored birds
Mel Spectrograms and used mixup with training data. Next, we applied spec-augmentations[33].
This method showed good performance in our local validation (CV), especially selective mixup
boosted + 0.03 our score.
   However, we also observed an overfitting behavior in some classes. because some scored
birds (21 classes as in Figure 2) have a long-tail distribution, model tends to predict more high
confidence scores for some birds like a “skylar" and “houfin" which has a large distribution.
Figure 3 shows the pre-processing pipeline and augmentations.

2.3. Modeling
We used 9 different backbones and 22 models. Our models are a combination of top solutions
from previous competition [4]. The main architectures are CNNs backbones with Sound Event
Detection heads [19], which showed good performance in the previous challenges [4]. We feed
5-second, 6-part Mel Spectrograms into the network as [22]. Figure 4 shows the pipeline of our
models. We also tried ConformerSED[34], FDY-SED[35], HTS-AT[36] but the results were much
worse than using well-known CNN approaches. We show the different backbones in Table 1.
In particular we focus on tf_efficientnet_b0_ns since it is light-weight and suitable for
smartphone devices, and its performance is consistently competitive.
Figure 3: Illustration of our pre-processing pipeline. In every training batch, we used mixup of ”scored"
birds. We can expect that selective mixup can make very robust training for rare birds [22].




Figure 4: Illustration of our Bird Classification model inspired in [22].


2.4. Training
We trained our models using focal binary cross-entropy loss, AdamW optimizer, and cosine an-
nealing scheduler with batch size 24. We also used the quality rating, which is meta information
on audio quality. While computing the loss, we weight it using the normalized quality rating,
and we used one-sided label smoothing, adding 0.01 across all negative labels. Both methods
were proposed by Henkel et al. [22] and improved our performance consistently.

2.5. Post-Processing
2.5.1. Penalization
We observed that our models are biased, and tend to predict with high confidence scores the
most represented birds (i.e. skylar and houfin). This implies a large number of False Positives
(FP) and misclassified clips. We give a penalty score depending on the distribution of the birds,
such that most represented birds are more penalized. Penalization (PN) can be explained as in
Eq.1 where x is the distribution of each bird. We used penalty factor=0.8. Penalization is not a
realistic technique, the model should not filter out bird species in that way, yet, in this scenario
it boosted our score on the public LB. As a better alternative, we aim to make our method less
sensitive to the data distribution and robust against background birdcalls from non-interest
birds, we tried to find class-wise thresholds of each bird species.
                                                             𝑥𝑖
                                p𝑖 = p𝑖 − penalty factor × ∑︀𝑛                                      (1)
                                                                 𝑖 𝑥𝑖

2.5.2. Class-Wise Thresholds
We observed that if there is a birdcall in the audio clips, regardless of the bird species, our
models show higher confidence scores. We used train_soundscape audio clips to validate nocall
thresholds for bird species using AUC score per each bird (as a binary classification problem
call/nocall). Even though there is no label for scored birds in train_soundscape, we can estimate
the appropriate nocall/birdcall thresholds for each bird. We used a grid search method to find
the best nocall quantile-based threshold of each bird, such that we achieve the maximum AUC
score per each bird, or in other words, such that we can distinguish better birdcalls from noise
or background, independent from the bird species present in the audio. Figure 5 shows the
distribution of probabilities using train_soundscapes. This Class-wise (CW) post-processing
method further boosted our score in comparison to Penalization and is more robust.




Figure 5: Distributions of nocall probability validated using train_soundscapes. We show the class-wise
best quantile thresholds to obtain the maximum AUC score per bird.
3. Results and discussion
Table 1 summarizes our experiments. We tested 9 different CNN backbones. We found difficult
to calibrate thresholds using an ensemble of models, yet, we used quantile-based thresholds on
the ensemble predictions. Penalization (PN) showed good performance in Public LB. However,
penalizing common birds as skylar or houfin that most probably appear in most of the audios
is not realistic. On the other hand, the Class-wise (CW) method showed better performance
in general, and it is robust to background birds. We find that calibration of thresholds is very
sensitive because there are very few rare birds in audio clips and the real world. Our results
imply that we can find proper thresholds for each rare bird using nocall/birdcall validation and
a quantile-based approach without strongly labeled data, as we show in Figure 5.
   We also provide qualitative Grad-CAM [37] results of our model tf_efficientnet_b0_ns
in Figure 6, which shows how our model is able to learn and focus on particular frequencies
and segments within the audio, and it is robust against background noise.

Table 1
Experiments result of models. For local validation, we used "micro F1-score" and train soundscapes. We
highlight in blue our top-3 models, in yellow the results of our final submission ensemble, and in green
the top solutions in the challenge LB. We also distinguish two post-processing methods: PN and CW.
Contemporary results from other competitors can be found at [38, 39, 40, 41].
                       Backbone               CV      Public LB    Private LB    Post-Proc.
            tf_efficientnet_b0_ns [13]      0.8745     0.7922        0.7240         PN
            tf_efficientnet_b0_ns [13]      0.8745     0.7817        0.7548         CW
                  eca_nfnet_l0 [42]         0.8761     0.7510        0.7387         CW
                   resnest50d [12]          0.8822     0.7550        0.7372         CW
             tf_efficientnet_b1_ns [13]     0.7843     0.7395        0.6946         CW
             tf_efficientnet_b2_ns [13]     0.8848     0.7277        0.7046         CW
             tf_efficientnet_b3_ns [13]     0.8561     0.7262        0.6640         CW
           tf_efficientnetv2_s_in21k [43]   0.8632     0.7620        0.7439         CW
              tf_efficientnetv2_b0 [43]     0.8762     0.7268        0.7070         CW
                   Ours Ensemble               -       0.7971        0.7733         CW
                   Ours Ensemble               -       0.8359        0.7630         PN
                       1st Place               -       0.8953        0.8527
                       2nd Place               -       0.9128        0.8438
                       3rd Place               -       0.8750        0.8126
                     BirdNet [44]              -        0.85          0.78



4. Conclusion
We hope our work can help researchers and conservation practitioners accurately survey
population trends, so they can regularly and more effectively evaluate threats. We present a
sound detection and classification pipeline for analyzing soundscape recordings. Our models
learn from few data and weak labels; they can accurately classify fine-grained bird vocalizations
in 0.04s using a single GPU. Moreover, they show robustness against noisy sounds (e.g., rain,
cars). We aim to improve the model’s efficiency for smartphone devices applications.
Figure 6: Grad-CAM [37] activations from our model tf_efficientnet_b0_ns on different validation
audio spectrograms. These qualitative results show how our model focuses on particular frequencies
through time to recognize the birds. Best viewed in electronic version.
Acknowledgments
Marcos Conde is supported by H2O.ai and by Humboldt Foundation (JMU Würzburg).
We would like to thank Kaggle and Dr. Stefan Kahl for hosting the BirdCLEF 2022 Challenge.
We also want to thank the contributions from: Amanda K. Navine, Ann Tanimoto-Johnson,
Hidehisa Arai, Christof Henkel, Pascal Pfeiffer, and Philipp Singer.


References
 [1] S. Kahl, A. Navine, T. Denton, H. Klinck, P. Hart, H. Glotin, H. Goëau, W.-P. Vellinga,
     R. Planqué, A. Joly, Overview of birdclef 2022: Endangered bird species recognition
     in soundscape recordings, Working Notes of CLEF 2022 - Conference and Labs of the
     Evaluation Forum (2022).
 [2] A. Joly, H. Goëau, S. Kahl, L. Picek, T. Lorieul, E. Cole, B. Deneu, M. Servajean, A. Durso,
     I. Bolon, H. Glotin, R. Planqué, W.-P. Vellinga, A. Navine, H. Klinck, T. Denton, I. Eggel,
     P. Bonnet, M. Šulc, H. Müller, Overview of lifeclef 2022: an evaluation of machine-
     learning based species identification and species distribution prediction, in: International
     Conference of the Cross-Language Evaluation Forum for European Languages, Springer,
     2022.
 [3] S. Kahl, Kaggle, Birdclef 2022, https://www.kaggle.com/competitions/birdclef-2022/, 2022.
     Accessed: 2022-06-03.
 [4] S. Kahl, T. Denton, H. Klinck, H. Glotin, H. Goëau, W.-P. Vellinga, R. Planqué, A. Joly,
     Overview of BirdCLEF 2021: Bird call identification in soundscape recordings, in: Working
     Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum, 2021, pp. 1437–1450.
 [5] S. Kahl, M. Clapp, W. Hopping, H. Goëau, H. Glotin, R. Planqué, W.-P. Vellinga, A. Joly,
     Overview of BirdCLEF 2020: Bird Sound Recognition in Complex Acoustic Environments,
     in: Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, 2020.
 [6] E. Sprengel, M. Jaggi, Y. Kilcher, T. Hofmann, Audio based bird species identification using
     deep learning techniques, in: CLEF, 2016.
 [7] M. Lasseck, Bird species identification in soundscapes, in: CLEF, 2019.
 [8] M. Lasseck, Audio-based bird species identification with deep convolutional neural net-
     works, in: CLEF, 2018.
 [9] J. Schlüter, Bird identification from timestamped, geotagged audio recordings, in: CLEF,
     2018.
[10] J. Bai, C. Chen, J. Chen, Xception based method for bird sound recognition of birdclef 2020,
     in: CLEF, 2020.
[11] M. Mühling, J. Franz, N. Korfhage, B. Freisleben, Bird species recognition via neural
     architecture search, in: CLEF, 2020.
[12] H. Zhang, C. Wu, Z. Zhang, Y. Zhu, H. Lin, Z. Zhang, Y. Sun, T. He, J. Mueller, R. Manmatha,
     M. Li, A. Smola, Resnest: Split-attention networks, 2020. arXiv:2004.08955.
[13] M. Tan, Q. V. Le, Efficientnet: Rethinking model scaling for convolutional neural networks,
     2020. arXiv:1905.11946.
[14] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, M. D. Plumbley, Panns: Large-scale pretrained
     audio neural networks for audio pattern recognition, 2020. arXiv:1912.10211.
[15] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, 2015.
     arXiv:1512.03385.
[16] M. V. Conde, K. Turgutlu, Exploring vision transformers for fine-grained classification,
     arXiv preprint arXiv:2106.10587 (2021).
[17] K. Drossos, S. I. Mimilakis, S. Gharib, Y. Li, T. Virtanen, Sound event detection with
     depthwise separable and dilated convolutions, in: 2020 International Joint Conference on
     Neural Networks (IJCNN), IEEE, 2020, pp. 1–7.
[18] E. Fonseca, M. Plakal, D. P. W. Ellis, F. Font, X. Favory, X. Serra, Learning sound event
     classifiers from web audio with noisy labels, 2019. arXiv:1901.01189.
[19] V. Lostanlen, J. Salamon, A. Farnsworth, S. Kelling, J. Bello, Robust sound event detection
     in bioacoustic sensor networks, PLOS ONE 14 (2019) e0214168. doi:10.1371/journal.
     pone.0214168.
[20] L. Zhang, S. Nizampatnam, A. Gangopadhyay, M. V. Conde, Multi-attention networks for
     temporal localization of video-level labels, arXiv preprint arXiv:1911.06866 (2019).
[21] N. Murakami, H. Tanaka, M. Nishimori, Birdcall identification using CNN and gradient
     boosting decision trees with weak and noisy supervision, in: Proceedings of the Working
     Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum, Bucharest, Romania,
     September 21st - to - 24th, 2021, volume 2936 of CEUR Workshop Proceedings, CEUR-WS.org,
     2021, pp. 1597–1608. URL: http://ceur-ws.org/Vol-2936/paper-136.pdf.
[22] C. Henkel, P. Pfeiffer, P. Singer, Recognizing bird species in diverse soundscapes under
     weak supervision, 2021. URL: https://arxiv.org/abs/2107.07728. doi:10.48550/ARXIV.
     2107.07728.
[23] M. V. Conde, K. Shubham, P. Agnihotri, N. D. Movva, S. Bessenyei, Weakly-supervised
     classification and detection of bird sounds in the wild. a birdclef 2021 solution, in: CLEF,
     CEUR-WS.org, 2021, pp. 1547–1558. URL: http://ceur-ws.org/Vol-2936/paper-131.pdf.
[24] G. Das, S. Aggarwal, Bird-species audio identification, ensembling 1d + 2d signals, in:
     Proceedings of the Working Notes of CLEF 2021, 2021.
[25] A. S. Kumar, D. Kowerko, Tuc media computing at birdclef 2021: Noise augmentation
     strategies in bird sound classification in combination with densenets and resnets, in: Pro-
     ceedings of the Working Notes of CLEF 2021, volume 2936 of CEUR Workshop Proceedings,
     CEUR-WS.org, 2021, pp. 1617–1626. URL: http://ceur-ws.org/Vol-2936/paper-138.pdf.
[26] M. V. Shugaev, N. Tanahashi, P. Dhingra, U. Patel, Birdclef 2021: building a birdcall
     segmentation model based on weak labels, in: Proceedings of the Working Notes of CLEF
     2021, 2021.
[27] J. Schlüter, Learning to monitor birdcalls from weakly-labeled focused recordings, in:
     G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi (Eds.), Proceedings of the Working
     Notes of CLEF 2021, volume 2936 of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp.
     1627–1638. URL: http://ceur-ws.org/Vol-2936/paper-139.pdf.
[28] J.-F. Puget, Stft transformers for bird song recognition, in: CLEF, CEUR-WS.org, 2021, pp.
     1609–1616. URL: http://ceur-ws.org/Vol-2936/paper-137.pdff.
[29] T. Weyand, A. Araujo, B. Cao, J. Sim, Google landmarks dataset v2 – a large-scale bench-
     mark for instance-level recognition and retrieval, 2020. URL: https://arxiv.org/abs/2004.
     01804. doi:10.48550/ARXIV.2004.01804.
[30] D. Stowell, M. D. Plumbley, An open dataset for research on audio field recording archives:
     freefield1010, 2013. URL: https://arxiv.org/abs/1309.5275. doi:10.48550/ARXIV.1309.
     5275.
[31] V. Lostanlen, J. Salamon, A. Farnsworth, S. Kelling, J. P. Bello, BirdVox-DCASE-20k: a
     dataset for bird audio detection in 10-second clips, 2018. URL: https://doi.org/10.5281/
     zenodo.1208080. doi:10.5281/zenodo.1208080.
[32] H. Zhang, M. Cisse, Y. N. Dauphin, D. Lopez-Paz, mixup: Beyond empirical risk minimiza-
     tion, 2017. URL: https://arxiv.org/abs/1710.09412. doi:10.48550/ARXIV.1710.09412.
[33] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, Q. V. Le, SpecAug-
     ment: A Simple Data Augmentation Method for Automatic Speech Recognition, in: Proc.
     Interspeech 2019, 2019, pp. 2613–2617. doi:10.21437/Interspeech.2019-2680.
[34] K. Miyazaki, T. Komatsu, T. Hayashi, S. Watanabe, T. Toda, K. Takeda, Conformer-based
     sound event detection with semi-supervised learning and data augmentation, DCASE2020
     Workshop (2020).
[35] H. Nam, S.-H. Kim, B.-Y. Ko, Y.-H. Park, Frequency dynamic convolution: Frequency-
     adaptive pattern recognition for sound event detection, 2022. URL: https://arxiv.org/abs/
     2203.15296. doi:10.48550/ARXIV.2203.15296.
[36] K. Chen, X. Du, B. Zhu, Z. Ma, T. Berg-Kirkpatrick, S. Dubnov, Hts-at: A hierarchical
     token-semantic audio transformer for sound classification and detection, 2022. URL: https:
     //arxiv.org/abs/2202.00874. doi:10.48550/ARXIV.2202.00874.
[37] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam: Visual
     explanations from deep networks via gradient-based localization, in: Proceedings of the
     IEEE international conference on computer vision, 2017, pp. 618–626.
[38] S. Krishnan, P. Khandelwal, R. Garg, Bird Species Classification: One Step at a Time, in:
     CLEF Working Notes 2022, CLEF: Conference and Labs of the Evaluation Forum, Sep. 2022,
     Bologna, Italy, 2022.
[39] E. Martynov, Y. Uematsu, Dealing with Class Imbalance in Bird Sound Classification, in:
     CLEF Working Notes 2022, CLEF: Conference and Labs of the Evaluation Forum, Sep. 2022,
     Bologna, Italy, 2022.
[40] A. Miyaguchi, J. Yu, B. Cheungvivatpant, D. Dudley, A. Swain, Motif Mining and Unsuper-
     vised Representation Learning for BirdCLEF 2022, in: CLEF Working Notes 2022, CLEF:
     Conference and Labs of the Evaluation Forum, Sep. 2022, Bologna, Italy, 2022.
[41] A. Sampathkumar, D. Kowerko, TUC Media Computing at BirdCLEF 2022: Strategies in
     identifying bird sounds in a complex acoustic environment, in: CLEF Working Notes 2022,
     CLEF: Conference and Labs of the Evaluation Forum, Sep. 2022, Bologna, Italy, 2022.
[42] A. Brock, S. De, S. L. Smith, K. Simonyan, High-performance large-scale image recognition
     without normalization, in: International Conference on Machine Learning, PMLR, 2021,
     pp. 1059–1071.
[43] M. Tan, Q. Le, Efficientnetv2: Smaller models and faster training, in: International
     Conference on Machine Learning, PMLR, 2021, pp. 10096–10106.
[44] S. Kahl, C. M. Wood, M. Eibl, H. Klinck, Birdnet: A deep learning solution for avian
     diversity monitoring, Ecological Informatics 61 (2021) 101236.