Methods for Training Convolutional Neural Networks to
                         Identify Bird Species in Complex Soundscape Recordings
                         Konstantin Dmitriev1,*
                         1
                             Lomonosov Moscow State University, 1 Leninskie Gory, Moscow, 119992, Russian Federation


                                        Abstract
                                        The task of bird species identification is very important in ecosystem monitoring. Modern methods based on
                                        the use of deep learning will allow such research to be carried out cheaply and on a regular basis. However,
                                        creating such algorithms is not an easy task due to the wide variety of birds, their calls, recording conditions,
                                        and equipment used. In this paper, some methods are presented for training Convolutional Neural Networks
                                        (CNNs) that improve the effectiveness of these models. This includes recording length standardization, data
                                        augmentation, mixing, sample selection, and weighting.

                                        Keywords
                                        Audio classification, sound event detection, signal processing, convolutional neuron network, augmentations,
                                        spectrogram


                         1. Introduction
                         Bird species diversity and its change in time serve as good indicators of ecosystem state. Traditional
                         methods of monitoring require the presence of a qualified observer who can manually identify the bird.
                         It’s quite hard and expensive to conduct such surveys regularly, especially in the case of large areas.
                         Many birds are small and hard to notice, but they have a loud voice. So, it seems promising to use small
                         and cheap audio recording devices with omnidirectional microphones instead of human observers and
                         to process the recordings using modern methods based on deep learning.
                            Creating and training such algorithms is a difficult task, however. The first problem is the diversity
                         of bird species and their recording conditions. There are many birds that can imitate other birds or
                         even repeat a sound they once heard and liked. Many animals and insects sound like birds. The second
                         problem is the difference between the available training data and the real recordings to be processed.
                         Usually, the training data is a set of short bird call recordings made by different people at different
                         locations. They try to make the recordings clean and loud, without noise or interference. So, good
                         equipment is used, including directional microphones, and bad recordings are dropped. The third
                         problem is that only weak labels are given that identify the bird’s existence in each recording but not
                         the exact call position.
                            BirdCLEF 2024 is a competition that is supposed to address the mentioned problems [1, 2]. It is a part
                         of the LifeCLEF 2024 conference [3]. The task is to identify bird calls in a set of recordings made in
                         the Western Ghats, India. The training dataset consists of recordings from the xeno-canto project [4].
                         Each of them has primary and secondary labels. The primary label corresponds to the main bird that
                         can be heard, and the secondary labels are used to mark additional birds that can accompany the main
                         bird. 182 bird species, whose existence needs to be predicted, were selected by the organizers. The
                         predictions must be made for each of the 5-second-long intervals of about 1100 recordings that form
                         the hidden dataset. An additional constraint is that the task must be completed in 2 hours using CPU
                         only. The macro-averaged AUC ROC that discards classes with no true labels was used as a metric in
                         the competition. To prevent overfitting, the full hidden dataset is split into the public and private parts


                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding author.
                          $ presentatio@mail.ru (K. Dmitriev)
                           0000-0001-5842-383X (K. Dmitriev)
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Figure 1: Model architecture.


     Table 1
     The CV accuracy score calculated using xeno-canto data for the models with different parameters:
     the backbone, the embedding dimension 𝐷, the number 𝐻 of attention heads, and the number 𝐵 of
     multi-head attention blocks.
 resnet18, 𝐻 = 8, 𝐵 = 2    resnet18, 𝐷 = 768, 𝐵 = 2   resnet18, 𝐷 = 768, 𝐻 = 8        𝐷 = 768, 𝐻 = 8, 𝐵 = 2
 𝐷          Accuracy       𝐻         Accuracy         𝐵          Accuracy         Backbone                 Accuracy
 256          0.755        1           0.651          0            0.604          resnet18                  0.741
 512          0.748        2           0.693          1            0.697          seresnext26t_32x4d        0.852
 768          0.741        3           0.703          2            0.741          seresnext50_32x4d         0.849
 1024         0.661        8           0.741          3            0.721          eca_nfnet_l0              0.679
 1536         0.650        16          0.734          4            0.611          efficientnet_b0           0.695
 2048         0.520        32          0.714                                      resnext50_32x4d           0.785
                           64          0.779


(approximately 35% and 65% of the data, respectively). The corresponding public and private scores are
calculated independently, and only the public score is known at the competition time.
   This article presents the methods that can be used to overcome the aforementioned difficulties and
improve the results of bird call recognition.


2. Methods
2.1. Model architecture
The used model is based on the model proposed in [5]. Its scheme is presented in Fig 1. After the signal
is loaded and normalized, the spectrogram transform is performed. Then, it is fed to a backbone CNN,
followed by 𝐵 multi-head attention blocks. Finally, a log-sum-exp pooling layer is used to extract the
label. The multi-head attention mechanism is described in the paper [6], and it is implemented in
PyTorch by the torch.nn.MultiheadAttention class. Its main parameters are the number 𝐻 of
parallel attention heads in each of the blocks and the embedding dimension 𝐷.
   To find the best combination of the model parameters, a number of tests were conducted. Instead of
using the macro-averaged AUC ROC score, which became very close to one after a few epochs, the
accuracy score was used in a 5-fold cross-validation (CV) scheme. In the tests conducted, the backbone
as well as the values 𝐵, 𝐻, and 𝐷 were varied. The results are presented in Table 1.
   The simple resnet18 [7] backbone was used to check different parameters. Using it, the best results
were achieved with 𝐷 = 256, 𝐻 = 64 and 𝐵 = 2. This slightly differs from the parameters presented
in [5], where they were set as 𝐷 = 768, 𝐻 = 8 and 𝐵 = 2. The results significantly improve with
heavier backbones, among which seresnext26t_32x4d [8] seems the best.
Figure 2: The “Sum” approach to making the lengths of recordings equal.


    Table 2
    The comparison of different approaches
                   “First” approach         “First and last” approach           “Sum” approach
        𝑡0
             Public score Private score    Public score Private score     Public score Private score
       5        0.606          0.584             –              –            0.595          0.560
       10       0.616          0.563           0.611          0.569          0.607          0.566
       20       0.586          0.572           0.599          0.561          0.607          0.571
       30       0.614          0.570             –              –            0.620          0.582


2.2. Dealing with different lengths of the recordings
An important problem with the recordings is that they have different lengths. The shortest of them
lasts only 0.5 seconds, while the longest is more than 1.5 hours. Different lengths don’t allow using
these recordings in batches while training the model. This causes the training to be slow.
   There are several possible ways to overcome this difficulty. Let 𝑡0 be the fixed final length of each
recording processed. If the initial recording is shorter, then it is simply padded with zeros. If it is
longer, the “First” approach is to use only the first 𝑡0 -long interval of the recording. The “First and last”
approach is to use the first 𝑡0 /2 and last 𝑡0 /2 intervals stacked together. This seems reasonable because,
as a rule, the recordings were processed by their authors before being uploaded to the website. So, one
can suspect that irrelevant sounds were cut off from the beginning and the end of each recording. These
two approaches are quite popular among the competition solutions. However, a lot of information is
lost. Instead, the third approach is proposed, which is called the “Sum” approach and illustrated in Fig 2.
It consists of the following steps.

   1. Each recording, having length 𝑡, is padded with 𝑡z = 𝑡 − 𝑡0 𝑛int zeros, where 𝑛int = ⌈𝑡/𝑡0 ⌉, i.e.,
      the resulting recording contains a whole number of intervals with the length of 𝑡0 .
   2. As an augmentation, a random circular shift is performed.
   3. The recording is split into 𝑛int intervals, and they are summed together. The length of the result
      is equal to 𝑡0 .

There is no information loss in the third approach. The overlapping of bird calls that may occur doesn’t
seem to be a problem since it corresponds to a situation when many birds vocalize at the same time.
   The same model (seresnext26t_32x4d; 𝐷 = 768, 𝐻 = 8, 𝐵 = 2) was trained using all the described
approaches with different 𝑡0 values. Every training was repeated three times with different random
seeds, and the “best” of them with the highest score on the public dataset was selected. The resulting
public and private scores are presented in Table 2.
   The results produced with different approaches are close to each other. However, the scores of the
“First” approach are slightly better with low 𝑡0 values. The growth of 𝑡0 doesn’t improve the scores
of the “First” and “First and last” approaches, but the scores of the “Sum” approach increase, and it
becomes preferable with large 𝑡0 . At the same time, increasing 𝑡0 makes the model training longer.
    Table 3
    The score of the model in different groups of bird species.
             Group           𝑁rec           Number of bird species   Public score   Private score
               1        0 < 𝑁rec ≤ 20                  34               0.543          0.536
               2       20 < 𝑁rec ≤ 50                  47               0.667          0.610
               3       50 < 𝑁rec ≤ 100                 27               0.652          0.585
               4      100 < 𝑁rec ≤ 200                 33               0.672          0.603
               5      200 < 𝑁rec ≤ 500                 41               0.569          0.569


Figure 3: The locations where the recordings were made.


2.3. Train data selection
The train dataset suggested at BirdCLEF 2024 competition contains approximately half of all the
recordings from xeno-canto project [4], with primary labels corresponding to 182 target birds. So, the
obvious step is to download the absent data and create an additional dataset. Merged together, these
two datasets contain about 40 000 recordings.
   Using the whole dataset, however, doesn’t improve the model score. It seems strange because the
higher diversity of data usually causes an increase in the generalization ability of the model. So, one
may expect that the additional data corrupts the dataset somehow, making it not correspond to the
hidden dataset.
   To find out the reason for the described behavior, the AUC ROC probing technique is proposed. This
technique is based on splitting the target data into several parts and masking the model predictions so
that only one part of the data is scored each time. In the BirdCLEF competitions, the bird species can be
grouped together. For example, let 𝑁 = 182 be the total number of bird species. One can select a group
of 0 < 𝑛 < 𝑁 bird species. During the submission, the predictions corresponding to the rest 𝑁 − 𝑛
species are set to zero. The constant prediction produces an AUC ROC score of 0.5. So, the resulting
AUC ROC score 𝑆 is equal to 𝑆 = (𝑆𝑛 𝑛 + 0.5(𝑁 − 𝑛))/𝑁 , and the AUC ROC score 𝑆𝑛 of the selected
group is equal to 𝑆𝑛 = (𝑆𝑁 − 0.5(𝑁 − 𝑛))/𝑛. Using this simple formula, it is possible to estimate the
model performance for different groups of species. If these scores differ significantly and the size of
each group is large enough, one may assume that the feature used for group selection is important.
   One of the possible features that can affect the model’s performance is the number 𝑁rec of available
recordings of each bird species, i.e., the frequency of its occurrence in the dataset. Following the
proposed AUR ROC probing technique, the species were split into five groups (Table 3). It can be
assumed that the model will work well for those birds for which the training set contains many records
and poorly in the other case. However, the situation is different. Group 5 with at least 200 recordings of
each bird species, has almost as low scores as Group 1 with a maximum of 20 recordings.
   One can conclude that, for some reason, the model has significant difficulties when dealing with
common birds. One of the reasons for this is that common birds may be present in many recordings
in the background while they are not marked, even with secondary labels. Another possible reason is
the geographical distribution of the places at which the recordings were made. Indeed, many common
birds were recorded in Europe, America, or Africa, far away from the region of interest. Birds can have
local dialects. Also, a bird, which is common in Europe, can be rare in India.
   The geographic information can be easily taken into account since GPS coordinates are provided.
The locations of places at which five bird species were recorded are presented in Fig 3. Here, “zitcis1”,
“commoo3” and “barswa” are the primary labels of common birds, with the number of recordings equal
to 500 in the competition dataset. The fourth bird, “revbul” is medium-rare and has 101 recordings, while
the fifth, “maltro1”, is a rare bird with 17 recordings. However, it is noticeable that almost all recordings
of common birds were made outside India, while “revbul” and “maltro1” are endemic birds. As a result,
the number of recordings of all common and rare species that are made in the Indian region is quite
low and significantly less than that of medium-rare birds. This observation explains the differences in
model scores across different groups of species.
   To handle this observation, the algorithm for train data selection is proposed, which consists of the
following steps.
   1. Prepare the whole dataset with all the recordings from the xeno-canto project [4] that contain
      the target bird species calls.
   2. For each recording, calculate its distance 𝐿WG to the Western Ghats region. This can be done, for
      example, by placing a large number of points in the Western Ghats region and calculating the
      minimal distance between these points and the point where the recording was made.
   3. Specify the maximum distance 𝐿max and drop all the recordings with larger distances.
   4. Calculate the number of recordings 𝑁   ̃︀rec for each bird species in the resulting dataset.
   5. Calculate the distance weight 𝑤𝐿 for each of the recordings. This weight is a decreasing function
      of 𝐿WG . For example, 𝑤𝐿 = 1 + cos(𝜋𝐿WG /𝐿max ) may be used.
   6. Calculate the class imbalance weight for each of the recordings. This weight is a decreasing
      function of 𝑁̃︀rec . For example, 𝑤imb = 1/𝑁  ̃︀rec may be used.
   7. The weight of each of the recordings in the final dataset is the product of distant and class
      imbalance weights: 𝑤 = 𝑤𝐿 · 𝑤imb .
   The value of 𝐿max is important. In the current competition, setting 𝐿max = 4000 km was a good
choice. From a geographical point of view, it allows to discard European, American, and most African
data while covering Southern Asia. Using the lower 𝐿max decreases the diversity of species, and with its
higher value, the training set includes irrelevant data. As a result, the public score of the model worsens
in both cases.

2.4. Additional noise sources
As it was mentioned earlier, there is a huge domain shift between the competition data used for training
and for model scoring. The training dataset contains the recordings from the xeno-canto project [4].
As a rule, these are recordings of high quality. However, the model is supposed to work well with the
data recorded with an omnidirectional microphone in a noisy environment. The unlabeled soundscapes
dataset is provided with recordings similar to those used for model scoring.
  Listening to the unlabeled recordings, it is possible to make a list of present noise sources. They
include car traffic and horns, aircraft noises, sirens, human voices, music, frogs, and cicadas, as well
as broadband noises of rain, wind, or even uncertain nature. The example spectrogram of such an
unlabeled sounscape is presented in Fig 4. One has to introduce all these kinds of interferences to the
Figure 4: The spectrogram of unlabeled soundscape 101125218.ogg with various sound sources.


model to increase its generalization ability and reduce the domain shift. At the same time, the addition
of an extra sound to the recording must not contain bird calls that can disorient the model.
   There are several datasets that can help introduce these noises; for example, the Vehicle Type Sound
Dataset [9], the Noise Audio Data Dataset with short sounds of different natures [10], the Rain Forest
Dataset [11] with recordings of several frog species, and the Hindi Speech Classification Dataset with
recordings of short phrases [12]. As an addition, manual selection of recordings not containing bird
calls can be used [13]. Although the use of a dataset with the regional dialects spoken in the Western
Ghats alongside Hindi may seem more appropriate, it’s quite hard to find a sufficient number of such
recordings distributed freely. At the same time, the influence of including these dialects on model
performance seems minor.
   The broadband noises are quite hard to add. On the one hand, many existing recordings of rain and
wind noises contain bird calls, which have to be manually filtered. On the other hand, these noises are
nonstationary, so they can’t be precisely modeled with any kind of simple stationary noise. A similar
situation takes place with the sounds produced by cicadas.
   To deal with this situation, it is proposed to use unlabeled soundscapes. These recordings, however,
contain bird calls that must be excluded. The idea of doing it is based on the fact that background noise
as well as cicadas form patterns on the spectrogram that slowly change over time. The patterns of bird
calls and other noises are irregular. The first step of the algorithm is taking the 1D Fourier transform of
the spectrogram along the time axis. On the second step, only the components with the largest absolute
values remain, while the others are put to zero. In the third step, the inverse 1D Fourier transform is
performed, and the result is multiplied by random noise. The described filtering procedure significantly
reduces the amount of information a spectrogram contains, and its “thin structure” disappears, including
bird calls. The example is presented in Fig 5.

2.5. Standard augmentations and post-processing
The “standard” augmentations can be used in the BirdCLEF competition. They are performed on
spectrograms and include XY masking, random grid shuffle, and recording mixing. XY masking selects
a few rectangular areas in the spectrogram and sets the data inside it to a constant. Random grid shuffle
splits the spectrogram into a grid and shuffles all its cells. This transform can be used only along the
time axis, and the size of each cell must be greater than the length of a potential bird call, say, 5 seconds.
Recording mixing is the technique of adding two or more recordings together before passing them to a
model. In this case, the resulting recording contains all the birds from the initial recordings. Its weight
is also the sum of the initial weights. These augmentations make the training dataset more diverse.
   The model predictions may be post-processed using sliding window averaging. This approach
assumes that if there is a bird call in a certain time interval, the probability of the same bird call in the
Figure 5: The spectrogram of unlabeled soundscape 1000308629.ogg (a) before and (b) after filtering procedure.


    Table 4
    The results of applying sliding window averaging.
                   𝛼       Public score     Private score    𝛼        Public score   Private score
                 0.000        0.681            0.616        0.250        0.693           0.621
                 0.100        0.687            0.619        0.275        0.694           0.621
                 0.150        0.690            0.620        0.300        0.694           0.621
                 0.200        0.692            0.621        0.350        0.693           0.621
                 0.225        0.693            0.621        0.400        0.692           0.620


    Table 5
    The inference time of different frameworks per 240-seconds-long recording.
                 PyTorch     ONNX         OpenVino     OpenVino with HT       OpenVino with TPE
                 3.3 sec     2.8 sec       2.5 sec          2.0 sec                  2.0 sec


neighboring intervals is also high. So, the final prediction for the current interval is a sum of predictions
for the current, previous, and next intervals with the weights of 1 − 2𝛼, 𝛼, and 𝛼 respectfully. The
coefficient 𝛼 is an averaging parameter, which is often set to 0.25 in BirdCLEF competitions. The results
of using different 𝛼 are presented in Table 4. It can be seen, that the value of 𝛼 = 0.25 is indeed optimal,
however, the public score is maximized by 𝛼 = 0.3.

2.6. Inference time optimization
The inference time on the CPU is one of the crucial factors in the competition. However, it was noticed
that the models became extremely slow after training. For example, the used model (seresnext26t_32x4d;
𝐷 = 768, 𝐻 = 8, 𝐵 = 2) processed the 240-seconds-long recording in 60-70 seconds after training,
while the untrained model did the same job in 3.5 seconds. The training procedure doesn’t change the
model architecture and only adjusts its weights.
  After some research, the problem was localized. In the model computational graph, some of the paths
are unnecessary, and the corresponding weights must be set to zero during training. However, using L2
regularization makes these weights very low but not exactly zero. As a result, not only do these paths
consume computational resources, but the computations with such low values are extremely slow. To
prevent this behavior, one has to retrain the model with an additional L1 regularization term, which
causes the small weights to be exactly zero. A simple solution for an already-trained model is weight
rounding. To do it, one has to convert the model precision from float32 to float16 and then back to float32.
Weight rounding can be performed with one line of code in PyTorch: model.half().float().
    Table 6
    The results of training the same model with different random seeds.
                                             Public score      Private score
                                  Seed 1        0.662                0.616
                                  Seed 2        0.645                0.615
                                  Seed 3        0.634                0.620
                                  Seed 4        0.669                0.619
                                 Average    0.653 ± 0.017     0.618 ± 0.002


    Table 7
    The results of applying different methods in the BirdCLEF 2024 competition.
 Method                                       Model 1       Model 2     Model 3   Model 4   Model 5   Model 6   Model 7
 Competition data                                 +                          +      +         +         +         +
 Additional data                                              +              +      +         +         +         +
 Select the recordings with 𝐿WG < 4000 km                                           +         +         +         +
 Additional noise sources                                                                     +         +         +
 Distant and class imbalance weights                                                                    +         +
 Sliding window averaging                                                                                         +
 Public score                                   0.620        0.560       0.622     0.663     0.672     0.681     0.684
 Private score                                  0.582        0.509       0.568     0.633     0.622     0.616     0.637


   Further acceleration is possible with the help of powerful frameworks, such as ONNX and OpenVino.
The model was converted to ONNX format and then exported to OpenVino. The inference time summary
is presented in Table 5. OpenVino seems faster than ONNX, but one may notice that it doesn’t use all
available CPU cores when running. To do so, hyperthreading (HT) must be switched on. The alternative
is to run multiple threads with multiprocessing. In Python, this can be done in several ways, for example,
by using the ThreadPoolExecutor class (TPE). As expected, the results are the same as with the use of
HT. It should be noted that many laptops have multiple cores and support HT now, so using it may
accelerate the model outside of the competition environment.


3. Results
The main difficulty of the BirdCLEF 2024 competition is an unstable public score. The minor changes in
the model and its training procedure may significantly increase or decrease this score. This makes it
quite hard to test different approaches. For example, the results of training with exactly the same model
and data and different random seeds are presented in Table 6. On the one hand, the standard deviation
of the public score may be several times larger than the improvement of the score with the use of some
clever technique. On the other hand, this makes it hard to introduce a reliable CV. It can’t be consistent
with the public score because it is unstable, and with the private score because there is a huge domain
shift between the data. So, the reasonable way is to conduct many experiments, take the average of the
public score, and hope that this will not cause overfitting to the public score.
   The methods described in Section 2 were applied consequently, and the results are presented in
Table 7. The most significant improvement was caused by the use of geographical data.
   The inference time of the resulting models was 28 minutes, so an ensemble of four models with the
best public scores was made. The public score of this ensemble was 0.713 (13th place in the competition
public leaderboard). However, the selected models overfit, and the private score of the ensemble was
as low as 0.616. Despite the unlucky model selection, the presented methods seem good and may be
successfully used in the future competitions and applications.
References
 [1] S. Kahl, T. Denton, H. Klinck, V. Ramesh, V. Joshi, M. Srivathsa, A. Anand, C. Arvind, H. CP,
     S. Sawant, V. V. Robin, H. Glotin, H. Goëau, W.-P. Vellinga, R. Planqué, A. Joly, Overview of
     BirdCLEF 2024: Acoustic identification of under-studied bird species in the western ghats, Working
     Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum (2024).
 [2] Birdclef 2024 – birdcall species identification from audio, 2024. URL: https://www.kaggle.com/
     competitions/birdclef-2024.
 [3] A. Joly, L. Picek, S. Kahl, H. Goëau, V. Espitalier, C. Botella, B. Deneu, D. Marcos, J. Estopinan,
     C. Leblanc, T. Larcher, M. Šulc, M. Hrúz, M. Servajean, et al., Overview of lifeclef 2024: Challenges
     on species distribution prediction and identification, in: International Conference of the Cross-
     Language Evaluation Forum for European Languages, Springer, 2024.
 [4] Xeno-canto sharing wildlife sounds from around the world, 2024. URL: https://xeno-canto.org.
 [5] M. V. Shugaev, N. Tanahashi, P. Dhingra, U. Patel, Birdclef 2021: building a birdcall segmentation
     model based on weak labels, CEUR Workshop Proceedings 2936 (2021). URL: https://ceur-ws.org/
     Vol-2936/paper-141.pdf.
 [6] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin,
     Attention is all you need (2023). URL: https://arxiv.org/abs/1706.03762. arXiv:1706.03762.
 [7] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, CoRR abs/1512.03385
     (2015). URL: http://arxiv.org/abs/1512.03385. arXiv:1512.03385.
 [8] J. Hu, L. Shen, S. Albanie, G. Sun, E. Wu, Squeeze-and-excitation networks, 2019. URL: http:
     //arxiv.org/abs/1709.01507. arXiv:1709.01507.
 [9] Vehicle type sound dataset, 2024. URL: https://www.kaggle.com/datasets/brinkor/
     vehicle-type-sound-dataset.
[10] Noise audio data dataset, 2024. URL: https://www.kaggle.com/datasets/javohirtoshqorgonov/
     noise-audio-data.
[11] Rainforest connection species audio detection data, 2021. URL: https://www.kaggle.com/
     competitions/rfcx-species-audio-detection/data.
[12] Hindi speech classification dataset, 2024. URL: https://www.kaggle.com/datasets/vivmankar/
     hindi-speech-classification.
[13] Nocall manual classification dataset, 2024. URL: https://www.kaggle.com/datasets/janmpia/
     nocall-manual-classification.