=Paper=
{{Paper
|id=Vol-2936/paper-137
|storemode=property
|title=STFT Transformers for Bird Song Recognition
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-137.pdf
|volume=Vol-2936
|authors=Jean-Francois Puget
|dblpUrl=https://dblp.org/rec/conf/clef/Puget21
}}
==STFT Transformers for Bird Song Recognition==
<pdf width="1500px">https://ceur-ws.org/Vol-2936/paper-137.pdf</pdf>
<pre>
STFT Transformers for Bird Song Recognition
Jean-Francois Puget 1
1
    NVIDIA, 400 avenue Roumanille, BIOT, 06410, France


                 Abstract
                 State of the art for audio classification often relies on the use of convolution neural networks
                 on images made of (log mel) spectrograms. Recently, new image classification models, namely
                 Vision Transformers (ViT), have been proposed. ViT work by mapping non overlapping image
                 patches to the input of a vanilla transformer model. While radically different from convolution
                 neural networks their performance is almost as good. Inspired by this we have explored various
                 ways to adapt ViT to spectrograms classification. The most effective way we found was to use
                 time slices of spectrograms as the input patches. We call this model a STFT Transformer. This
                 model has been evaluated and used during the BirdCLEF 2021 competition. This model alone
                 would have got the 15th rank out of 816 competitors. When blended with some convolution
                 networks it led to a 11th rank and a gold medal in that competition.


                 Keywords 1
                 Vision Transformer, Audio classification, Bird song recognition, CEUR-WS

1. Introduction

    Bird song recognition is key for monitoring environment evolution and detect possible degradations.
The use of machine learning to predict which species are present on audio recordings is a key enabler.
A lot of progress has been possible by leveraging the extraordinary progress of computer vision machine
learning models in the last decade. This progress is visible through BirdCLEF, a yearly competition
series focusing on bird song recognition from audio recordings. Models that win one of these
competitions are baselines for the next one. Progress is steady.
    The 2021 edition [1][2], hosted on Kaggle, was no exception. Top solutions from previous year
competition were used as baseline, eg [3][4]. Participants had to find novel techniques to beat previous
year baseline. We present a novel technique we used, called STFT Transformer. It was inspired by
recently image classifications models called Vision Transformers [5][6].
    Using STFT Transformer is in line with what is now standard practice. Audio recordings are
transformed into two dimensions images via the application of short term Fourier transform (STFT) on
overlapping windows on the signal. The resulting image is called a spectrogram and has two
dimensions: a frequency axis, and a time axis. The spectrogram can optionally be further transformed
by log and mel transforms to get a non linear frequency axis. Figure 1 shows a 5 second waveform and
its log mel spectrogram.
    Once we have 2D images we can use all the machinery developed for image classification, including
state of the art convolution neural networks (CNNs) like EfficientNet. We can also use image
augmentation techniques to make models more robust.


1
 CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
EMAIL: jpuget@nvidia.com
ORCID: 0000-0003-1978-0791
             ©️ 2021 Copyright for this paper by its authors.
             Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
             CEUR Workshop Proceedings (CEUR-WS.org)
Figure 1: A waveform (top) and its log mel spectrogram (bottom)

   Using CNNs on spectrograms must be done carefully though, because spectrograms are not ordinary
images. The two axes of spectrograms (frequency and time) are not interchangeable. Some commonly
used image augmentation techniques such as image rotation make no sense with spectrograms. A further
issue is that CNNs are good at learning translation independent patterns. This is good for the time axis
but is detrimental on the frequency axis as it corresponds to pitch change. We don’t want models that
ignore the pitch. STFT Transformer solves this problem entirely.
   The contribution of this paper is four folds:
        • A new STFT Transformer model working on short term Fourier transform sequences
             directly
        • A way to reuse pretrained Vision Transformer model weights for STFT Transformer
        • A robust training pipeline for STFT Transformer fine tuning
        • Competitive results on the 2021 BirdCLEF competition
   The paper is organized as follows. We describe STFT Transformer in section 2, and the 2021
BirdCLEF competition in section 3. We discuss how the model was trained on BirdCLEF data in section
4. We report competition results for STFT vs CNNs in section 5. We conclude and discuss possible
improvements in section 6.

2. STFT Transformer

   Using transformers to audio spectrograms is not straightforward. One of the issues is how to train
them with audio data. Like any transformer model, this would require a lot of data. For comparison,
ViT [5] has been trained on a 300M images dataset. The BirdCLEF competition training data is way
too small for reliable training with about 64k recordings only.
   One way to cope with small data is to use a pretrained model and fine tune it on the competition
data. Unfortunately, there are no pretrained transformers for spectrograms. We decided to reuse
pretrained weights from vision transformers pretrained on large image datasets. For this we needed to
use spectrograms of the size expected by Vision Transformers. ViT expects an image of size 384x384
pixels. It then extracts non overlapping 16x16 patches arranged in a 24x24 grid.
   Our first try was to create 384x384 spectrograms. Results were very disappointing, mostly because
the time axis was too short, i.e. the frame rate (384/5 = 75 Hz) was too small. We then realized that we
could use a different grid shape while keeping the number of patches unchanged. We tried 12x48
patches (192x768 spectrogram, 154 Hz frame rate), and 16x36 (256x576 spectrogram, 115 Hz frame
rate). Results were better because frame rate was higher, but still way below CNNs.
    One reason for poorer performance was because ViT adds learnable position embeddings to its input
patches. These embeddings capture the spatial relationship between patch locations. When we change
the grid the position embeddings are no longer correct.
    We implemented an interpolation from the 24x24 grid embeddings to the new grid size, either 12x48
or 16x36 grid. This interpolation was called after pretrained weights were loaded into the model and
before training. As a result, the position embeddings captured the new spatial information accurately.
This improved the model accuracy a bit, but most importantly, it made training convergence easier.
    After revisiting the grid shape, we revisited patches shapes. If we use a spectrogram of size 256x576
pixels, i.e. a frame rate of 115Hz, then one time slice of the spectrogram (one STFT) contains 256
pixels. And 256 pixels is precisely the number of pixels in a 16x16 patch. Therefore, we decided to feed
a 1x576 grid of 256x1 patches to ViT. The main consequence of this choice is that patches encompass
all the frequency axis. There is no longer a need to worry about their frequency position. And there
cannot be any pitch invariance in the model anymore.
    Still ViT expects a 384x384 image. We therefore had to reshape the 256x576 spectrogram into a
24x24x16x16 image. Figure 2 shows how the spectrogram from Figure 1 looks like after the reshaping.


Figure 2: example of an input to STFT Transformer
   We tried various ways to adapt the existing position embeddings to a one dimensional grid but the
best was to just leave them as is. During fine tuning the model adjusted them to the one-dimensional
sequence nature of the input.
   We call the resulting model a STFT Transformer given it takes as input a STFT sequence. This
model has been evaluated and used during the BirdCLEF competition, see below.

3. BirdCLEF 2021 competition

  The competition was hosted on Kaggle [6], the largest machine learning competition worldwide,
now owned by Google. Training data for this year competition came in two flavors:
       • 63874 recordings from Xeno Canto [7]. Each recording comes with a primary label, i.e.
          the name of a bird species heard in the recording. It also has secondary labels, which are
          other species heard in the recording, but maybe less often and not as loud as the primary
              label. Duration of these recordings ranges from 5 seconds to minutes. Location, time of
              recording, and many other meta data are available for each of these recordings.
         • 20 soundscapes of 10 minutes duration. Each 5 second clip of train soundscape comes with
              a list of bird species heard during the 5 seconds, or nocall if there is none.
    Test data is hidden from contestants. It is made of about 80 soundscapes similar to the train
soundscapes. The labels for test data are also hidden. There were 397 bird species present in training
data. Competition hosts said that all species present in test data also appeared in training data. Test data
was split between a public part and a private part that were kept hidden.
    Contestants had to provide a script or a jupyter notebook on Kaggle [6] that reads test data, and that
outputs a file containing predictions for each 5 second clip of each test soundscape, without being able
to access any internet resource. Hiding the test data is a great way to test generalization of machine
learning models to new data. Disabling internet access is to make sure no one can leak test data.
    Contestants could train machine learning models as they see fit. Then they usually created a Kaggle
dataset containing their trained model weights. This Kaggle dataset was attached to the test notebook.
Code in the notebook could then load these weights from the attached dataset and use them. This way
it was possible to use pretrained models on hidden test data.
    Once contestant code was submitted it was run with the hidden test data, and two scores were
computed. but the score obtained on the public test part was reported to the contestant. Contestant best
public score was displayed in a public leaderboard. It was then possible to see the performance of all
contestants in real time. This public leaderboard is a very effective gamification tool. It is very addictive,
and most of the contestants checked it on a daily basis during the competition, if not often.
    At the end of the competition scores computed on the private test part are disclosed in a private
leaderboard. Ranking on the private leaderboard is the official result of the competition. Top 11 ranks
were awarded a gold medal. Top 3 ranks also won a prize money.


4. Training models

   Even with pretrained weights, fine tuning transformers is tricky. Pipeline designed for training
convolution networks fail miserably with exploding loss functions, or disappointing model
performance. We devised a training pipeline that enabled robust training of good STFT Transformers.


    4.1.         Data preprocessing

    We trained the model on the Xeno Canto recordings. It was clear that training on random 5 second
clips had a drawback: some of the clips may not contain the target bird. We then used a simple
hypothesis: clips were stripped, i.e. periods without a song at the beginning or at the end were removed
for the sake of reducing storage needs. We therefore trained on first 5 seconds or last 5 seconds of clips,
assuming these would contain the primary label bird. We preprocessed all the Xeno Canto recordings
and saved their first 8 seconds and their last 8 seconds as Numpy files. This speeded up data loading
significantly given we did not need to load full recordings.
    In order to improve model confidence when predicting no call, we added to our training data the
nocall clips from freefield1010 dataset [8]. We added a 498th class to represent the nocall. As a result
our model could predict both one or more birds, or a nocall.
    Train soundscapes could contain several bird species in a 5 second clip. To model that out of mono
bird clips from Xeno Canto we implemented a mixup variant were up to 3 training clips could be
merged. To model the presence of several bird species we define the target for the merged clip as the
maximum of the targets of each merged clip. The presence of nocall clips from freefield1010 added
background noise when they were used via mixup.
    We used librosa [9] to compute spectrograms. We used 2048 n_fft, 250 n_mels, and the default
windows length. We computed the hop length as a function of the desired spectrogram length: a shorter
hop length yields a longer spectrogram. The number of mels defines the height of the spectrogram. It
was then resized to the desired height. We contemplated moving spectrogram computation to GPU but
we saw that the GPU was fully busy on Kaggle with model prediction.
    We implemented pitch variation and speed variations by modifying the sampling rate and the hop
length for the STFT computation routine from librosa. A higher sampling rate than the actual sampling
rate moved the pitch upward. A shorter hop length amounted to using to a slowed down recording. Clip
length was adjusted to yield the same spectrogram length as when no data transformation was used. It
is why we stored clips longer than 5 seconds. It enables using long enough clips when hop length is
increased. We also added gaussian noise used in previous year solution [3].
    When predicting on soundscapes (train or test) we disabled all data augmentations. Soundscape was
fed 5 second per 5 second to the model.

    4.2.         Loss function and secondary labels
    Contestant submissions were evaluated by comparing the model predictions to the test ground truth.
For each 5 second clip a F1 score between the predictions and the ground truth is computed. Then these
F1 scores are averaged.
    F1 score cannot be used directly as a loss function for training models because it is not differentiable.
We simply used binary cross entropy for a 498 multilabel problem. For each training clip we defined a
498 target, with a 1 for the primary label or a nocall.
    Primary labels were noisy, because there is no warranty that the primary label does appear in the
first 5 seconds and in the last 5 seconds of every Xeno Canto recording. But they must happen often
enough given the good result of our models. We hypothesize that the small proportion of cases where
they do not appear in these clips acts as a regularization, a bit like label smoothing.
    Secondary labels were even noisier than primary labels. And there, the clipping argument does not
apply. Secondary labels could occur really anywhere in training clips. To cope with that we masked the
loss for secondary labels as we didn't want to force the model to learn a presence or an absence when
we don't know. We therefore defined a secondary mask that nullifies the BCE loss for secondary labels.
For instance, assuming only 3 bird species b0, b1, and b2, and a clip with primary label b0 and secondary
label b1, then these two target values are possible:
    [1, 0, 0]
    [1, 1, 0]
    The secondary mask is therefore:
    [1, 0, 1]
    The loss for b1 is masked and ignored. For merged clips in mixup, a target is masked if it it not one
of the primary labels and if it is one of the secondary labels.

    4.3.         Cross validation

    We used two ways to evaluate our model strength. The first one, used by many contestants, is to
use the train soundscape as holdout dataset. Models are trained on Xeno Canto clips, then used to make
predictions on every 5 second clips of every train soundscape. Then F1 score is computed as for the
test data.
    The test data had 54% of no calls as witnessed by a nocall submission score. In order to adjust for
the difference in the proportion of nocall clips in training soundscapes and test soundscapes we scaled
the computation of validation score. We compute F1score on no call rows separately from F1 score on
bird call rows, then compute final score with
    score_all = 0.54 * score_nocall + (1 - 0.54) * score_birds                              (1)
    Our validation score was very close to the submission score, except for some models from time to
time. In order to detect we used 5 folds cross validation on xeno canto clips. We adjusted the score with
equation (1) as well.
    We then defined our local score as the minimum of the train soundscape score and the Xeno Canto
cross validation score. This revised local score was very well correlated with leaderboard score.
    4.4.          Training transformers

   We started training our modified vision transformers using the same pipeline as our CNNs, namely
60 epochs with Adam optimizer with weight decay and a one cycle lr scheduler. Given the class
imbalance we used a posweight of 10 in BCE loss as well as a class weight that was proportional to
the inverse of each bird species frequency. This worked very well in past competitions for CNNs. It
worked very well here too because our CNN stayed on top of the public leaderboard during all the
competition till the last week.
   However, it was impossible to train transformers with it. Either loss was exploding, or the models
did not train much. We hypothesized that it was because loss landscape was not smooth at all for
vision transformers, with very narrow local minima. This hunch has been confirmed since then in a
paper published after the competition deadline [10].
   To make transformer learn we hypothesized that keeping gradient as small as possible would
enable reaching very narrow loss minima. We tried gradient clipping but it was too crude. We finally
ended up with using a combination of amp autoscale, no class weights, and a small learning rate of
1e-4. In order to cope with the class imbalance we sampled clips according to a weight that depended
on its primary label frequency. The less frequent, the more often the clip would be selected.
   We tried several variants of ViT and only kept the original ViT [5] and its DeiT variant [11],
mostly because they were the ones supporting 16x16 patches on images larger than 224x224. We used
timm [12] implementation of vision transformers and other models. Models were trained on NVIDIA
DGX Station V100.

5. Results

   Our models were trained on 5 second clips. They may miss longer time span patterns as well as
species co-occurrences. We could have addressed this by using a sound event detection model [13]
with our models as backbone. Instead, we decided to postprocess our soundscape prediction logits
using 3 thresholds tuned on train soundscapes, pres_thr, incr_thr and occur_incr:
    1. A key observation was made in [3] and [4]: if a bird is heard in one of the 5 second clips of a
        soundscape then it is more likely to be heard in others. To capture this correlation, we
        computed the maximum prediction for each species across all 5 second clips. For species
        where this maximum is greater than pres_thr we predicted this species presence as soon as a 5
        second clip prediction is above incr_thr.
    2. If two bird species are known to co exist then hearing one makes the other one more likely.
        For this we computed a co occurrence matrix using primary and secondary labels. If a species
        was detected then logits for all co occurring species were increased by occur_incr.
   Typical values for pres_thr, incr_thr and occur_inc are: 2.1, -0.2, 0.9.
   We report results for 5 models, STFT Transformer, two Vision Transformers using 16x16 patches,
ViT and DeiT, and two variants of our best CNN in Table 1. The two CNNs uses the same
EfficientNet b3 backbone, but with different input sizes. For each model we report the validation
score on training data, the public leaderboard score and the private leaderboard score.

Table 1
Model performance
   Backbone             Image size            Val score            Public LB            Private LB
Efficientnet b3          300x600               0.7855               0.7906                0.6554
Efficientnet b3          200x600               0.7853               0.7965                0.6630
   16x16 ViT             192x768               0.7645               0.7445                0.6317
  16x16 Deit             192x768               0.7631               0.7580                0.6397
   STFT DeiT             256x576               0.7755               0.7569                0.6667
   We see that DeiT outperformed ViT when applied to spectrograms with 16x16 patches. Therefore,
we did not submit a ViT version of STFT Transformer and only used the DeiT version. STFT
Transformer looked weaker on the public leaderboard than the CNNs, but it performed slightly better
on the private leaderboard. We also see that STFT Transformer is significantly better than DeiT using
16x16 patches. We believe it is because the former removes the pitch translation issue that the latter
faces.
   STFT private score is 15th on the private leaderboard. Given most top teams used an ensemble of
model we think STFT Transformer alone is in the top 10 models of the competition. When blended
with the two CNNs in the table it led to the 11th rank and a gold medal. It is a bit surprising that our
models, trained on short clips, were competitive with sound event detection (SED) models [13] trained
on 30 second clips used by most top teams in the competition.

6. Conclusion and future work
   State of the art models for audio classification often rely on convolution networks applied to
spectrograms. We presented a new model for audio clip classification, called STFT Transformer. This
model is inspired by recently introduced Vision Transformers. Main difference is that our model takes
as input the sequence of short-term Fourier transforms of the audio signal, instead of using 16x16
patches from spectrograms. We described how to reuse weights of Vision Transformers pretrained on
large image datasets. We also described how the model was trained and used in the 2021 BirdCLEF
competition. The model seems competitive with top convolution networks. We claim it is because STFT
Transformer does not learn pitch invariant patterns.
   Our model can certainly be improved in many ways. The most obvious is to train it on onger clips,
for instance by using it as a backbone in a SED model [13]. We also would like to explore pretraining
on a larger audio dataset instead of reusing models pretrained on image datasets. Last, but not least, we
would like to use the new SAM optimizer [10] to train our STFT Transformers.
   Our code is available at https://github.com/jfpuget/STFT_Transformer

7. Acknowledgement

This work was fully funded by NVIDIA. Models were trained on NVIDIA DGX Station V100. The
author is part of a NVIDIA team whose mission is to compete on Kaggle and other platforms to
demonstrate the effectiveness of NVIDIA hardware and software stack. We would like to thank the
organizers and Kaggle for hosting a flawless competition. Data was clean, metric was well defined,
and hosts provided great introductory notebooks.

8. References

    [1] A. Joly, H. Goëau, S. Kahl, L. Picek, T. Lorieul, E. Cole, B. Deneu, M. Servajean, R. Ruiz
        DeCastañeda, I. Bolon, H. Glotin, R. Planqué, W.-P. Vellinga, A. Dorso, H. Klinck, T.
        Denton, I. Eggel, P. Bonnet, H. Müller, Overview of LifeCLEF 2021: a System-oriented
        Evaluation of Automated Species Identification and Species Distribution Prediction, in:
        Proceedings of the Twelfth International Conference of the CLEF Association (CLEF 2021),
        2021.
    [2] S. Kahl, T. Denton, H. Klinck, H. Glotin, H. Goëau, W.-P. Vellinga, R. Planqué, A. Joly,
        Overview of BirdCLEF 2021: Bird call identification in soundscape recordings, in: Working
        Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum, 2021.
    [3] R Wong, 1st Place Solution, 2020. URL:
        https://www.kaggle.com/c/birdsong-recognition/discussion/183208
    [4] V.Kramarenko, 2nd place solution, 2020. URL:
        https://www.kaggle.com/c/birdsong-recognition/discussion/183269.
    [5] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
        Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly,
     Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image
     recognition at scale. Arxiv, 2010.11929, 2020.
[6] Kaggle. URL:
     https://www.kaggle.com
[7] Xeno Canto. Sharing bird sounds from around the world. URL:
     https://www.xeno-canto.org/
[8] D. Stowell, Freefield1010, 2013. URL:
     https://archive.org/details/freefield1010
[9] McFee B, Raffel C, Liang D, Ellis DP, McVicar M, Battenberg E, et al. librosa: Audio and
     music signal analysis in python. In: Proceedings of the 14th python in science conference.
     2015.
[10] X. Chen, CJ. Hsieh, B. Gong, When Vision Transformers Outperform ResNets without
     Pretraining or Strong Data Augmentations arXiv:2106.01548, 2021.
[11] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablay-rolles, and H. Jegou. Training data-
     efficient image transformers & distillation through attention. arXiv:2012.12877, 2020.
[12] R Whitman, timm, 2021. URL:
     https://github.com/rwightman/pytorch-image-models
[13] Q. Kong, Y. Cao, Y. Wang, W. Wang, M. Plumble, PANNs: Large-Scale Pretrained Audio
     Neural Networks for Audio Pattern Recognition, arXiv:1912.10211, 2020.

</pre>