Dealing with Class Imbalance in Bird Sound
Classification
Eduard Martynov1 , Yuuichiroh Uematsu2
1
    Lomonosov Moscow State University, GSP-1, Leninskie Gory, Moscow, 119991, Russian Federation
2
    Ricoh Company, Ltd , 2-7-1, Izumi, Ebina-shi, Kanagawa 243-0460, Japan


                                         Abstract
                                         Recent achievements in machine learning have allowed to create fully-autonomous bird sound detec-
                                         tion pipelines; however, they usually suffer from weak performance on underrepresented classes. We
                                         overcome this issue by proposing an approach which combines custom Convolutional Neural Networks
                                         and Pretrained Audio Neural Networks (PANNs). During training, we leverage pseudo labels as well as
                                         the hand labels for small classes. Moreover, we distribute classes between models and use different loss
                                         functions to train them. Our solution has achieved third place on private leaderboard in BirdCLEF 2022
                                         challenge.

                                         Keywords
                                         Convolutional Neural Networks, Audio classification, BirdCLEF 2022, Sound Event Detection, Pretrained
                                         Audio Neural Networks, Computer Vision, CEUR-WS


1. Introduction
The BirdCLEF challenge plays an important role in developing biodiversity monitoring methods
throughout the world. Previous BirdCLEF challenges [1, 2] utilize micro-like metrics such as
F1-micro and class mAP. It has allowed participants to focus on improving top-line evaluation
statistics on a common core set of species with large data amount, while species with less data
available were primarily left uninvestigated.
   The aim of BirdCLEF 2022 [3, 4] challenge is to enhance the detection performance of various
bird calls, for which it is hard to acquire audio samples. In the task participants were given the
dataset total of 15182 training samples, containing 152 classes. Only 21 of them were scored.
Some classes had only one sample in the given dataset; non-scored ones were introduced to
enrich dataset with other audios containing bird calls. Each training sample has associated
primary and secondary labels. The primary label bird usually can be heard very clearly in the
first and last 5 seconds of the audio clip. The birds from secondary labels can be heard anywhere
in the audio.
   The test set is larger by almost a factor of 10 than it was in the previous competition and
contains 5500 soundscapes with duration of approximately one minute. For each 5-second


CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ mart.eduard67@gmail.com (E. Martynov); yuuichiroh.uematsu@jp.ricoh.com (Y. Uematsu)
 0000-0002-2122-0024 (E. Martynov)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
segment in every test audio, the participants were asked to predict whether each of 21 scored
birds can be heard or not.
   In our solution, we use custom CNN model proposed in [5] and PANN-like Sound Event
Detection (SED) model proposed in [6] which showed good performance throughout BirdCLEF
competition history. However, custom training techniques were required in order to achieve
good performance, we’ll discuss them below.


2. Dataset preparation
All of our models use 2D mel spectrograms as input. We used mel spectogram transform
implemented in torchaudio [7] library in order to convert raw audios to 2D images. This
implementation enabled the conversion of audio clips directly on GPU, which boosted training
speed by factor of 10 compared to similar approach with librosa [8] that executes conversion
on CPU instead.


3. Model architecture
3.1. Convolutional Neural Network
One of the models that we used was custom CNN proposed in [5]. This model achieved second
place in BirdCLEF 2021 challenge, so it is a good baseline which we decided to utilize not only
because of its performance, but also due to the nature of the model.
   We train this model using random 30-second crops, and for input audio that does not have this
length, we pad it with zeros. As we can see from the Figure 1, before we feed mel spectorgrams to
the backbone, we reshape them from 30-second crops to 6 equal 5-second parts which essentially
limits receptive field of this network to 5-second crops. We think that this architectural design
allows the network to generalize to 5-second crops as well.
   As a loss function we use BCE loss, and as targets we use union of primary and secondary
labels. We believe that for this model precise localization of birds is not necessary, since random
30-second crops almost always contain target signal. To select models on validation, BCE loss


Figure 1: CNN training pipeline proposed in [5]. Input sample is a random 30-second crop from the
original audio.
was used as a metric. For given audio from validation set, we crop first 30 seconds and use it as
an input to the model.
  For the inference we use 5-second crops directly.

3.2. Pre-trained audio neural network
Additionally, in our solution we used a PANN-based [6] model architecture, which achieved
high accuracy in last year competition. We used several versions of this model which differ in
backbone. These models were trained using two stages.
   At first stage we split the dataset into 4 folds and train models using 10-15 second random
crops in cross validation manner. Each of the 4 models learned underlying structure of the data
and obtained the ability to distinguish between different bird calls. However, since during first
stage we used random crops, it’s not absolutely necessary that model received proper gradients.
Sometimes, random crop can contain no bird call at all, in this case the label provided to the
model is wrong. This can introduce unwanted noise and can be dealt with by using pseudo
labels.
   As for our approach to fight weak labels, at the second stage we used pseudo labels obtained
from predictions on out-of-fold data. For given audio, we selected all segments that contained
bird calls according to the predictions of the model from the first stage. We also dropped
secondary labels, if our model was not confident enough about them. Of course, we zeroed-out
any probabilities from pseudo-labels which correspond to the birds that weren’t included neither
in the union of primary label and secondary labels for the given audio clip.
   We think that for good performance of SED models pseudo-labels are crucial, since the input
samples are 10-15 second long. We believe that training without pseudo-labels can affect the
performance of these models.
   We then re-train these models using pseudo-labels using BCE loss and Focal loss [9] for
different models.
   The checkpoint selection was based on F1-macro metric in both stages.


4. Augmentations
The difference between training and testing data is big, since training data is a collection of
human recorded audios of birds from xeno-canto web library [10], while the test data is a set of
automatically recorded soundscapes. To fight the domain shift and make sure that our models
generalize well to the unseen data, we use the set of the following augmentations:

    • Mixup [11]
      We applied mixup augmentation with probability of 1.0 and alpha = 1.0, this augmentation
      stabilized the training and when applied with cosine annealing learning rate schedule
      allowed models to converge even when trained on whole training dataset.
    • Cutmix [12]
      We also applied cut-mix augmentation to further improve stability. Applied after mix-up,
      it didn’t introduce any noticable changes.
    • Background noise
      To simulate the noisy environment of the nature, we also added background noise to all
      audio samples with different SNR. To not accidentally add noise containing bird call, we
      selected no-call samples from freefield1010 [13], BirdVox-DCASE-20k [14] and previous
      year challenge data and used them as background noise.
    • Random power
      We randomly raised mel spectrograms to a power varying from 0.5 to 3.
    • Spec-Augment [15]
      We randomly dropped 2 time stripes and 8 frequency stripes from mel spectrograms
      during training to enrich training dataset.
    • Gaussian SNR
      We also added the gaussian noise to allow further generalization of the model.


5. Oversampling and hand labels
Adding hand-crafted labels was one of our approaches to enhance performance on underrep-
resented classes. First we manually extracted segments from audio data with the target bird
singing and then split the audio to increase the amount of training samples. This work took 4-5
hours, as it was done only for small classes.
   Second, we increased the amount of training samples of some class up to 10 by random
oversampling, provided this class had less than 10 samples. This number was chosen to balance
between underfitting and overfitting to certain classes during training.


Figure 2: SED advantage over CNN. This figure shows that CNN can sometimes miss samples from
small classes, however, SED doesn’t lack the ability to detect them.
Figure 3: CNN advantage over SED. In this figure we can clearly see that CNN captures the information
about well-represented classes better than SED.


6. Ensembling
6.1. Bird split
During ensembling stage we figured out that simple averaging of prediction of SED trained with
Focal loss and CNN trained with BCE loss does not improve the competition metric value. To
reason this we inspected probability distributions of predictions on both target and non-target
data for every bird. It appeared that SED model showed tendency to make more conservative
predictions, however it did not miss bird calls belonging to underrepresented classes as can be
seen on Figure 2. On the other hand, CNN model was always confident and and can divide the
data very well for large classes, predicting probabilities for non-target data close to 0 and for
target data close to 1. It allowed CNN model to reduce number of false positive detections while
having the benefit of obtaining high probabilties for the actual bird calls, Figure 3.
   To address that, we split birds into two groups; the first one contains 7 underrepresented
birds - ’crehon’, ’ercfra’, ’hawgoo’, ’hawhaw’, ’hawpet1’, ’maupar’, ’puaioh’, which we call this
set of birds Group One; and the second group contains other 14 scored birds, we call it Group
Two.
   In the final ensemble we ended up using SED models trained with Focal loss to predict birds
from Group One, while for the birds from Group Two we used CNN models and SED models
both trained with BCE loss.

6.2. Postprocessing
We applied so called time-smoothing as in [5] post-processing to probabilities obtained for birds
from Group Two, which is essentially a sliding window weighted average of probabilities applied
on the time axis. It can be seen as a soft way of lowering the thresholds for the model, since
only probabilities with high-scored neighbours receive a considerable gain. This postprocessing
generates missed true positive detections with inconsiderable amount of newly introduced false
positives.
6.3. Threshold selection
Since metric of BirdCLEF 2022 challenge is threshold-dependent, we had to accurately select
the thresholds for all birds using different properties of this year dataset:

    • The probability distribution of out-of-folds predictions for non-target data when using
      Focal loss differs depending on the bird, so we set the threshold for each bird from Group
      One depending on this distribution. We adopt the value of 91 percentile of the probability
      distribution for these birds.
    • For birds from Group Two, we set the threshold to 0.05, except for "skylar", for this bird the
      threshold was set to 0.35, since it was obvious from validation that our models recognize
      this bird very well. We found out that for our models threshold 0.05 is the best on public
      leaderboard.


7. Results
As our best submission, we used ensemble of CNN and SED models with proper bird split
between models. We show our results in Table 1.
  Individual models performed competitively, but we had to use ensemble to get the third place.

Table 1
Model results. This table highlights the performance of different ensembles of the models.
                                  Models                            Public LB    Private LB
                  CNN model without augmentations                     0.7715     0.7278
                     CNN model with augmentations                     0.7761     0.7359
                          Best CNN ensemble                           0.8327     0.7898
                          SED model (4 folds)                         0.8339     0.7823
             SED + CNN ensemble using two groups of birds             0.8532     0.8052
         Add SED trained with BCE loss to birds from Group Two        0.8750     0.8126
          Lower the thresholds for Group Two birds (0.05 -> 0.03)     0.8707     0.8274


8. Conclusion and future work
During this challenge we explored various models and found out that it was necessary to carefully
choose training strategies to get the best performance. Different techniques such as pseudo
labeling, oversampling and hand labeling were tested and performance was verified, as well
as the smart ensembling of various models. Moreover during training we used BirdCLEF 2022
competition dataset along with no-call samples from freefield1010 [13], BirdVox-DCASE-20k
[14] and previous year challenge data, each of these datasets was a good source of background
audio samples.
   As for the future work, we would like to inspect the impact of random-crop length as well
as impact of pseudo-labels for CNN models, since we think that smaller crops can benefit the
model during training and make it easier to learn signal; however, it is harder to acquire correct
training samples in this case.


References
 [1] S. Kahl, T. Denton, H. Klinck, H. Glotin, H. Goëau, W.-P. Vellinga, R. Planqué, A. Joly,
     Overview of birdclef 2021: Bird call identification in soundscape recordings, 2021, pp.
     1437–1450. URL: http://ceur-ws.org/Vol-2936/paper-123.pdf.
 [2] S. Kahl, M. Clapp, W. A. Hopping, H. Goëau, H. Glotin, R. Planqué, W.-P. Vellinga, A. Joly,
     Overview of birdclef 2020: Bird sound recognition in complex acoustic environments, 2020.
     URL: http://ceur-ws.org/Vol-2696/paper-262.pdf.
 [3] A. Joly, H. Goëau, S. Kahl, L. Picek, T. Lorieul, E. Cole, B. Deneu, M. Servajean, A. Durso,
     I. Bolon, H. Glotin, R. Planqué, W.-P. Vellinga, A. Navine, H. Klinck, T. Denton, I. Eggel,
     P. Bonnet, M. Šulc, H. Müller, Overview of lifeclef 2022: an evaluation of machine-
     learning based species identification and species distribution prediction, in: International
     Conference of the Cross-Language Evaluation Forum for European Languages, Springer,
     2022.
 [4] S. Kahl, A. Navine, T. Denton, H. Klinck, P. Hart, H. Glotin, H. Goëau, W.-P. Vellinga,
     R. Planqué, A. Joly, Overview of birdclef 2022: Endangered bird species recognition
     in soundscape recordings, Working Notes of CLEF 2022 - Conference and Labs of the
     Evaluation Forum (2022).
 [5] C. Henkel, P. Pfeiffer, P. Singer, Recognizing bird species in diverse soundscapes under
     weak supervision, 2021. URL: https://arxiv.org/abs/2107.07728. doi:10.48550/ARXIV.
     2107.07728.
 [6] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, M. D. Plumbley, Panns: Large-scale pretrained
     audio neural networks for audio pattern recognition, 2019. URL: https://arxiv.org/abs/1912.
     10211. doi:10.48550/ARXIV.1912.10211.
 [7] Y.-Y. Yang, M. Hira, Z. Ni, A. Chourdia, A. Astafurov, C. Chen, C.-F. Yeh, C. Puhrsch,
     D. Pollack, D. Genzel, D. Greenberg, E. Z. Yang, J. Lian, J. Mahadeokar, J. Hwang, J. Chen,
     P. Goldsborough, P. Roy, S. Narenthiran, S. Watanabe, S. Chintala, V. Quenneville-Bélair,
     Y. Shi, Torchaudio: Building blocks for audio and speech processing, arXiv preprint
     arXiv:2110.15018 (2021).
 [8] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, O. Nieto, librosa:
     Audio and music signal analysis in python, in: Proceedings of the 14th python in science
     conference, volume 8, 2015.
 [9] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object detection, 2017.
     URL: https://arxiv.org/abs/1708.02002. doi:10.48550/ARXIV.1708.02002.
[10] X. Canto, Sharing bird sounds from around the world, 2022. URL: xeno-canto.org.
[11] H. Zhang, M. Cissé, Y. N. Dauphin, D. Lopez-Paz,                   mixup: Beyond empirical
     risk minimization, CoRR abs/1710.09412 (2017). URL: http://arxiv.org/abs/1710.09412.
     arXiv:1710.09412.
[12] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, Y. Yoo, Cutmix: Regularization strategy
     to train strong classifiers with localizable features, CoRR abs/1905.04899 (2019). URL:
     http://arxiv.org/abs/1905.04899. arXiv:1905.04899.
[13] D. Stowell, M. D. Plumbley, freefield1010 - an open dataset for research on audio field
     recording archives, in: Proceedings of the Audio Engineering Society 53rd Conference on
     Semantic Audio (AES53), Audio Engineering Society, 2014.
[14] V. Lostanlen, J. Salamon, A. Farnsworth, S. Kelling, J. Bello, Birdvox-full-night: A dataset
     and benchmark for avian flight call detection, ICASSP, IEEE International Conference
     on Acoustics, Speech and Signal Processing - Proceedings, Institute of Electrical and
     Electronics Engineers Inc., 2018, pp. 266–270. doi:10.1109/ICASSP.2018.8461410.
[15] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, Q. V. Le, SpecAugment:
     A simple data augmentation method for automatic speech recognition, in: Interspeech
     2019, ISCA, 2019. URL: https://doi.org/10.21437%2Finterspeech.2019-2680. doi:10.21437/
     interspeech.2019-2680.