<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Classic Approaches to Bird Song Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mihai-Dimitrie Minuț</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cristian Simionescu</string-name>
          <email>cristian@nexusmedia.ro</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adrian Iftene</string-name>
          <email>adiftene@info.uaic.ro</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Convolutions</institution>
          ,
          <addr-line>EfficientNetV2, Noise</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University “Alexandru Ioan Cuza” of Iasi - Faculty of Computer Science</institution>
          ,
          <addr-line>Street General Henri Mathias Berthelot 16, Iași 700483, Iași</addr-line>
          ,
          <country country="RO">Romania</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Processing bird audio data and building classifier algorithms lead to different insights about the bird species, in turn improving what is known about avian biodiversity. The research for audio processing and audio classification has been ongoing for some time and with major advances. Paired together with the many existing species and sound types of birds it makes the task of bird song classification complex. This work explores popular audio methods to preprocess and augment two types of extracted features, audio and spectrogram images, by making use of deep learning techniques such as pretraining, fine-tuning, and transfer learning. Results feature averagely good models while showcasing the effect of certain data modifications or classification training techniques.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Sound classification</kwd>
        <kwd>Mel Spectrograms</kwd>
        <kwd>1/2D Augmentation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        This work represents the working note and the details from participating in the BirdClef2023
competition, which is taking place as part of the LifeClef2023 branch [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The competition goal is to
construct classification algorithms that will be used to recognize and attribute the bird species for a
given audio signal as input [2]. The main focus of this working note is the deep learning experiments
based on both types of data: audio time series and image spectrogram data; paired together with
different preprocessing and augmentation techniques for both image and signal data. The best results
feature models based on 1D CNN architectures trained and fine-tuned on audio signal data that make
use of background noise as augmentation; achieving 0.63186 and 0.74384 private and public scores
respectively. The rest of the paper includes the details related to the activity submitted by us regarding
sending the runs in this competition.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Solution Objectives</title>
      <p>The large amounts of audio data and the BirdClef2023 competition requirements for the model
evaluation, under 2 hours and CPU only runs, imposed quite strict restrictions on what an individual
can submit, making very deep and complex models impractical [2]. By taking the restrictions and the
development environment, along with the team size of one person and the late registration to the
competition (last month), the goals are not set to be on the overpromising side. As such, the proposed
objectives of this work are to:
• establish a preprocessing and augmentation pipeline for the dataset;
• experiment on simpler model architectures trained from scratch;
• experiment with finetuning and transfer learning on pre-trained architectures, such as the
EfficientNetv2 [3];
• cover experiments on both signal time series data and spectrogram image data;
• explore with Gaussian and Normally distributed background noise [4] [5];
• construct a final model pipeline with the best-found insights;</p>
      <p>Some of the selected experiments were chosen to test and experiment with past popular techniques
applied to audio time series and spectrogram data. A good example would be the time series and
spectrogram image augmentation results presented in [6], which present Time Warping and Scaling as
being among the best augmentation methods for audio time series.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Data Exploration</title>
      <p>The dataset provided by the competition after registration contains 264 folders representing each
class through an ID, with distinct amounts containing audio data files with varying durations. Each
directory contains audio recordings saved in the OGG digital format. By exploring the dataset, some
issues are identified: the audio file class imbalance (Fig. 1), different noise levels between audio files
(Fig. 3), and empty segments longer than 5 seconds (Fig. 2).</p>
    </sec>
    <sec id="sec-4">
      <title>4. Data Preprocessing</title>
    </sec>
    <sec id="sec-5">
      <title>4.1.Audio Data Frame Cutting</title>
      <p>The initial audio file data is not in the correct format that allows for easy processing, so it needs to
be cut into 5 seconds pieces with a sample rate of 32,000, implying a resulting dataset of samples of
shape (160,000). Among the considered techniques for cutting up data from Figure 4, the one that
would bring the most quality would be using bird sound detection [7]. The compromise between time
and quality was to take the simple window shift without overlap because it would contain windows
with the bird calls and a smaller quantity of empty fragments compared to the window shift with the
overlap method. By applying the aforementioned method, the balance of data has suffered even
further, with some classes having as little as 1 window of 5 seconds while others have over 1,000
inputs worth of data.</p>
    </sec>
    <sec id="sec-6">
      <title>4.2.Dataset Splitting</title>
      <p>The setup for evaluation is done by splitting the training data by a ratio of 70:30 between train and
evaluation, as a swifter alternative to the better-performing K-folds cross-validation technique [9].</p>
    </sec>
    <sec id="sec-7">
      <title>4.3.Feature extraction</title>
      <p>From the initial raw audio frames, by applying the Short-time Fourier Transform [9] and the Mel
Scale transformation the log-mel features can be extracted [10], which can be further transformed into
the Mel spectrogram image. To obtain a square image shape with size (128, 128), the following
parameters for the algorithm were used: 128 Number of Mels, the competition sample rate of
32,000/second, and a hop length equal to ⌈ 160000 / 128 ⌉.</p>
    </sec>
    <sec id="sec-8">
      <title>4.4.Dataset Augmentation</title>
      <p>The augmentation was done both on the preprocessing side for the spectrogram data but only
dynamically on batch load during training. The three used techniques for augmentation, which are
shown to work well on audio data [6], are Time Warping, Scaling, and Shifting (a subclass of
permutation with one cut). Data scaling is a form of augmentation that reduces or decreases the values
of a random time window of the time series [6]. The magnitude of the scaling was uniformly chosen
between -0.1 and 0.1. Time Warping dilates or contracts random windows of time that are
predominantly used as a technique for time series augmentation [6]. As can be seen in Fig.5 the effect
on the spectrogram data is pretty damaging which is why it was omitted from the spectrogram
preprocessing pipeline; this augmentation technique is still safe to use in the case of working directly
with the time series data.</p>
      <p>Shifting, a subclass of permutation, is an augmentation technique that cuts the time series in two
and rearranges the order resulting in pieces [6]. Two types of shifting were applied: shifting the time
series by a factor between -0.1 and 0.1, and cutting directly in half, and reordering the two pieces. The
impact of the shifting on the spectrogram image data was minimal in most cases due to the bird song
not being constant during the audio. Still, on the signal time series data counterpart, the discontinuity
of values among the time axis was pretty visible and clear to affect the data quality (see Fig. 6).</p>
      <p>The algorithm used to augment the dataset in the preprocessing phase is the one described in
Algorithm 1. The goal was to obtain at least 20 distinct input samples per class and favor species with
less than 264 input audio frames and generate more data for them.</p>
      <sec id="sec-8-1">
        <title>Algorithm 1. Augmentation process for each of the 5 second input audio</title>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>4.5.Class Weights</title>
      <p>After the augmentation of the spectrograms, the least represented classes had at least 20 samples
but still nothing in comparison to 8,000 others; the time series data being unaffected. The problem
with weights being too high or low is that during training this will cause the loss to overshoot and
explode or vanish certain errors, making the learning process much harder. The solution applied was
to smooth out the weights for both the spectrogram and the signal data samples, the result is shown in
Fig. 7. The process for smoothing the data is described in Algorithm 2.</p>
      <p>Algorithm 2. Method used to directly scale and smooth the training weights for the model training</p>
    </sec>
    <sec id="sec-10">
      <title>4.6.Mixup</title>
      <p>The mixup operation was done directly on the audio signals, and then the spectrogram feature
extraction was applied to the final combined signal to obtain instead of the original, the mixed-up
spectrogram. Usually, Mixup is done during training between random samples [9], but given the
circumstances of time and resources available, it was decided to build a dataset generator using the
mixup method. Additionally, to address the remaining imbalance within the dataset the mixup dataset
generator also got provided with the initial class weights and target class weights for the resulting
dataset which further balanced the data amount per bird species.</p>
      <p>The target mixup generated dataset has a size of 100,000 and minimum label items of 264. The
current label weights is the one presented in Figure 40, the target label weights are the ones in Figure
7 and the data input is the augmented data, as previously mentioned in the augmentation section for
spectrograms. The total amount of data needed per label was given by the target class weights
multiplied by the dataset size and finally clipped by the minimum requirement of 264 items.</p>
    </sec>
    <sec id="sec-11">
      <title>4.7.Partitioning</title>
      <p>The final dataset data was separated into multiple partition files, which contained the respective
training and validation data in order. One last step applied was to also shuffle the data in order to
break the order between the dataset species.</p>
    </sec>
    <sec id="sec-12">
      <title>4.8.Preprocessing Pipeline</title>
      <p>There were two final configurable preprocessing pipelines for building the datasets used to train
the models submitted to the competition, which can be seen in Fig. 8: one for images and another one
for signals. In Fig. 8, all of the experimental preprocessing steps were included: with green the kept
ones and red the discarded ones. Validation data is represented by the orange color while training by
the violet one. The final product of the pipelines is the variations of the signal dataset, variations of
the spectrogram dataset, and the final training class weights. The final datasets containing
spectrograms were generated both with and without any augmentation to cover all cases when
experimenting. A considerable observation would be that the augmentation that will happen during
the training process is not shown in the below schema, and it will be included in the Deep Learning
Experiments chapter.</p>
    </sec>
    <sec id="sec-13">
      <title>5. Deep Learning Experiments</title>
      <p>Common denominators among all the experiments are the Adam optimizer with a starting learning
rate of 0.00001-0.000001, the Binary Crossentropy Loss function, and the ROC AUC metric used to
evaluate the model on training and validation data. The PR AUC metric is also used as a measurement
in some of the experiments. One common feature is the method of loading the data partitions which
was done by prefetching and loading at least 2 partitions ahead, which improved the training times for
both the image and time series datasets. Initial testing was done with the batch size with values from
{16, 32, 64, 128, 256} and 64 items per batch achieved the shortest train epoch duration in both
scenarios: image and time series data. All of the models were scheduled to train up to 100 epochs and
used a method to save and keep track of the best validation-performing model at each epoch. If no
improvements to the validation metrics were made for more than 20 epochs the training would be
stopped. The experiments tables are made to highlight the most significant changes in the training
process and one entry from those tables might be in reality a series of tests concluding into that result.
There were 3 model types with which were experimented: 2D convolutional models trained from
scratch on spectrogram data, 1D convolutional models for signal data trained from scratch, and
experiments with transfer learning and fine-tuning the EfficientNet V2 B0.</p>
      <p>The final submission was made by taking the best three already submitted models (Table 4, indices
6, 7, and 8, focus on public score) and fine-tuning them in 3 steps of 3 epochs each, on the validation
data. Fine-tuning was done only on the last dense layers with a much lower learning rate. After the 3
fine tuning iterations, the best model was the one using the base 1D convolutional architecture paired
together with some uniform background noise (indices 8-11 from Table 4).</p>
      <p>The model submissions were made using the Kaggle web platform2 for uploading the models and
running the notebooks. An important note is that only one author kaggle account was used in
preparing and submitting competition runs.</p>
    </sec>
    <sec id="sec-14">
      <title>6. Experiment Results</title>
      <p>The augmentation methods that were used, Time Warping, Scaling, and Shifting as described in
[6], did not seem to impact the training that much, sometimes even being detrimental; with no strong
evidence in the case of spectrograms due to the case of faulty data. Among all of the training
experiments, multiple activation functions have been tried and the ones with the most consistent good
results were Sigmoid, Relu, and LeakyRelu. By consulting all the experiments, the fine-tuning
techniques always helped and provided models with an easier way to improve. Another, not as
highlighted, technique is the class weights applied to the loss, which helped combat the data
imbalance, especially during the 1D convolutional training sessions. The best submission made (index
9 from Table 4) was a convolutional 1D model built and trained from scratch by using a variety of
deep learning techniques combined with added background uniform noise; which was also fine-tuned
on the validation dataset as a last successful step at improving the quality.</p>
      <sec id="sec-14-1">
        <title>2 https://www.kaggle.com/competitions/birdclef-2023</title>
        <p>Training and experimenting with both the waveform of audio data and spectrogram image data
helps in gathering (Tables 1-3) a variety of helpful insights regarding the portability and consistency
of the used training and pretraining methods. It also provides a solid ground for training more
complex ensemble models that make use of both types of data.</p>
      </sec>
    </sec>
    <sec id="sec-15">
      <title>7. Future Work</title>
      <p>There are many possible directions for further explorations, coming from both explored and
unexplored ideas. An interesting idea that some participants used during the competition was to
complement the existing dataset with external bird audio data. This proved to be one of the key points
in obtaining higher-quality classification models. A future iteration of the dataset could potentially fix
the problem either by applying transformations to entire data files or by finding a way to exclude
empty audio frames altogether. There are many possible directions for further explorations, coming
from both explored and unexplored ideas. An interesting idea that some participants used during the
competition 24 was to complement the existing dataset with external bird audio data. This proved to
be one of the key points in obtaining higher-quality classification models. Because of the data issue,
the spectrograms’ convolutional models were compromised. A future iteration of the dataset could
potentially fix the problem either by applying transformations to entire data files or by finding a way
to exclude empty audio frames altogether.</p>
    </sec>
    <sec id="sec-16">
      <title>8. Conclusion</title>
      <p>Looking back to the original goals and comparing them to the current state, the dataset-building
pipeline has been finalized and, besides the problem in the mel spectrogram image pipeline, it helped
in building training-ready datasets. The performed experiments added value through the insights they
offered and allowed in the end to build better valid submissions. Most techniques did not manage to
achieve a good performance, while others seemed to always improve the results; a case being the
added background noise. The classic and simple model architectures turned out to be able to learn to
distinguish among the bird species and even showed much more potential.
W.-P. Vellinga, H. Klinck, T. Denton, I. Eggel, P. Bonnet, H. Müller, Overview of LifeCLEF
2023: evaluation of ai models for the identification and prediction of birds, plants, snakes and
fungi, in: International Conference of the Cross-Language Evaluation Forum for European
Languages, Springer, 2023.
[2] Kahl, S., Denton, T., Klinck, H., Reers, H., Cherutich, F., Glotin, H., Goëau, H., Vellinga,
W.P., Planqué, R., Joly, A.: Overview of BirdCLEF 2023: Automated bird species
identification in Eastern Africa. Working Notes of CLEF 2023 – Conference and Labs of the
Evaluation Forum (2023)
[3] M. Tan and Q. V. Le, “Efficientnetv2: Smaller models and faster training,” CoRR, vol.</p>
      <p>abs/2104.00298, 2021.
[4] M. V. Shugaev, N. Tanahashi, P. Dhingra, and U. Patel, “Birdclef 2021: building a birdcall
segmentation model based on weak labels,” in Conference and Labs of the Evaluation Forum,
2021.
[5] C. Zhang, H. Zhan, Z. Hao, and X. Gao, “Classification of complicated urban forest acoustic
scenes with deep learning models,” Forests, vol. 14, p. 206, Jan 2023
[6] B. K. Iwana and S. Uchida, “An empirical survey of data augmentation for time series
classification with neural networks,” PLOS ONE, vol. 16, pp. 1–32, 07 2021.
[7] A. S. Kumar and D. Kowerko, “Tuc media computing at birdclef 2021: Noise augmentation
strategies in bird sound classification in combination with densenets and resnets,” in
Proceedings of the Working Notes of CLEF 2021 - Conference and Labs of the Evaluation
Forum, Bucharest, Romania, September 21st - to - 24th, 2021 (G. Faggioli, N. F. 0001, A.
Joly, M. Maistro, and F. Piroi, eds.), vol. 2936 of CEUR Workshop Proceedings, pp.
1617–1626, CEUR-WS.org, 2021.
[8] Korjus, M. N. Hebart, and R. Vicente, “An efficient data partitioning to improve
classification performance while keeping parameters interpretable,” PLOS ONE, vol. 11, pp.
1– 16, 08 2016.
[9] E. Sejdic, I. Djurovi ´ c, and J. Jiang, “Time–frequency feature representation using energy ´
concentration: An overview of recent advances,” Digital Signal Processing, vol. 19, no. 1, pp.
153–183, 2009.
[10] B. Logan et al., “Mel frequency cepstral coefficients for music modeling.,” in Ismir, vol.</p>
      <p>270, p. 11, Plymouth, MA, 2000
[11]H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk ´
minimization,” CoRR, vol. abs/1710.09412, 2017.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Botella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Deneu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Marcos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Estopinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chamidullin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Šulc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hrúz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Glotin</surname>
          </string-name>
          , R. Planqué,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>