<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TUC Media Computing at BirdCLEF 2024: Improving Birdsong Classification Through Single Learning Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Arunodhayan Sampath Kumar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tobias Schlosser</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Danny Kowerko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Junior Professorship of Media Computing, Chemnitz University of Technology</institution>
          ,
          <addr-line>09107 Chemnitz</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This work presents our contribution to the BirdCLEF 2024 competition, aimed at enhancing birdsong classification through the implementation of single learning models. Our primary objective was to develop a robust model capable of processing continuous audio data to accurately identify bird species from their calls. Key challenges included addressing imbalanced training data, managing domain shifts between training and test samples, and processing extensive soundscape recordings within a limited time frame. Our proposed approach leverages transformer-based architectures, incorporating positional encodings to enhance the spatial context of the input spectrograms. Data augmentation techniques were employed to mitigate the efects of noisy labels and domain shifts. The model training process involved the use of various libraries and frameworks, with a focus on optimizing performance through strategies such as cosine annealing and weighted sampling. Our results indicate the potential in improving the accuracy and eficiency of birdsong classification to support avian population monitoring and conservation eforts. From a total of 974 teams that ranked between 45.8 % and 69.0 % in terms of ROC-AUC on the private leaderboard, our best model achieved a ROC-AUC score of 62.9 %, ranking us at 300th place among all submissions. On the public leaderboard, our model achieved a score of 63.9 %.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Audio Classification</kwd>
        <kwd>Birdsong Soundscapes</kwd>
        <kwd>Computer Vision and Pattern Recognition</kwd>
        <kwd>Convolutional Neural Networks (CNN)</kwd>
        <kwd>Data-Eficient Image Transformers (DeiT)</kwd>
        <kwd>Vision Transformers (ViT)</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and motivation</title>
      <p>
        The BirdCLEF 2024 [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] competition aimed to identify bird species in a global biodiversity hotspot
in the Western Ghats, also known as the Sahyadri. The broader goals of BirdCLEF 2024 included
developing a deep learning model capable of processing continuous data to recognize bird species by
their calls. Specifically, the objectives were to identify endemic bird species in the soundscape data of
the sky-islands, detect and classify endangered bird species (species of conservation concern) despite
having limited training data, and detect and classify nocturnal bird species, which are currently poorly
understood.
      </p>
      <p>This year’s primary challenges involved a significant imbalance in the training set, a shift in the
domain between the clean training samples and the soundscape test samples, and the time constraint of
testing 73.3 hours of diverse soundscape recordings in only 2 hours of time. Nevertheless, advancements
in deep learning models are expected to aid in monitoring avian populations, facilitate more efective
threat evaluation, and allow for timely adjustments to conservation actions. Ultimately, this will benefit
avian populations and support long-term sustainability eforts.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Fundamentals and implementation</title>
      <p>
        The implementation of transformer-based bird species recognition presented in this work comprises
data preparation, feature extraction, model architecture, data augmentation, and training methods. This
work builds on our previous implementations for BirdCLEF 2021 and 2022, which utilized convolutional
neural networks (CNN) [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ].
      </p>
      <sec id="sec-2-1">
        <title>2.1. Data Preparation</title>
        <p>
          The BirdCLEF 2024 training set consisted of 24 459 audio recordings provided by xeno-canto [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], covering
182 bird species. The test dataset consisted of 1 100 recordings, each of 4 minutes in length. Table 1
provides an overview of the individual datasets utilized. Datasets with IDs 2, 3, and 4 were used for
model pre-training. Datasets with IDs 5 and 6 were employed as a data augmentation technique for
background noise. In total, 141 637 recordings across 1 000 classes were collected for training and
data augmentation. The BirdCLEF 2024 train set consisted of audio files sampled at 32 kHz using a
single channel (mono) and compressed using lossy Ogg compression. Similarly, the datasets used for
pre-training and augmentation were converted to 32 kHz and a single channel. The training dataset was
weakly labeled, meaning there was no precise information on the presence or absence of the labeled
birdsong within the recordings. However, hearing the labeled birdsong at the beginning or the end
of each audio file is typically highly probable. These audio files were trimmed for the first 5 and last
5 seconds and saved as NumPy arrays to speed up the data loading procedure since we do not need
to load the complete audio file. For training and cross-validation, the dataset is split into 5 stratified
folds. For pre-training, we made sure that the species present in BirdCLEF 2024 are not present in our
pre-training dataset in order to avoid leakage.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Feature extraction</title>
        <p>
          We used the librosa [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] Python library to convert a 1D audio signal into 2D log-mel spectrograms. The
spectrogram representation used the following parameters:
• Width (W) = 576 or 256
• Height (H) = 256 or 196
• Sampling rate (SR) = 32 000 Hz
• hop_length = 284
• mel_bins = W // 2
• frequency_min = 50 Hz
• frequency_max = 16 000 Hz
• power = 2.0
• top_db = 100
        </p>
        <p>Algorithm 1 Resizing and rearranging spectrograms method for feature extraction.
Require: Input spectrogram  of shape either (576, 256) or (256, 196)
Ensure: Output tensor  of shape either (384, 384) or (224, 224)
1: if .shape = (576, 256) then
2: Initialize  ← zeros(384, 384, dtype = .dtype)
3: 1 ←  .view(24, 24, 16, 16)
4: for  = 0 to 23 do
5: for  = 0 to 23 do
6: [ × 16 : ( + 1) × 16,  × 16 : ( + 1) × 16] ← 1[, ]
7: end for
8: end for
9: else if .shape = (256, 196) then
10: Initialize  ← zeros(224, 224, dtype = .dtype)
11: 1 ←  .view(14, 16, 16, 16)
12: for  = 0 to 13 do
13: for  = 0 to 15 do
14: [ × 16 : ( + 1) × 16,  × 16 : ( + 1) × 16] ← 1[, ]
15: end for
16: end for
17: else
18: Raise error: Unsupported input shape
19: end if</p>
        <p>return</p>
        <p>Vision transformers (ViT) oftentimes require spectrograms that are either 384 × 384 or 224 × 224 in
size. Resizing spectrograms from 576 × 256 to 384 × 384 or from 256 × 196 to 224 × 224 did not yield
any positive results. Therefore, we utilized a resizing and rearranging spectrograms method to ensure
that spectrograms are properly provided to the transformer model. The detailed algorithm for this
approach is shown in Algorithm 1. Its spectrogram realization is illustrated in the accompanying Fig. 1.</p>
        <sec id="sec-2-2-1">
          <title>Input</title>
        </sec>
        <sec id="sec-2-2-2">
          <title>Spectrogram</title>
        </sec>
        <sec id="sec-2-2-3">
          <title>Dropout</title>
        </sec>
        <sec id="sec-2-2-4">
          <title>Feature</title>
        </sec>
        <sec id="sec-2-2-5">
          <title>Extraction</title>
          <p>(ViT / DeiT)</p>
        </sec>
        <sec id="sec-2-2-6">
          <title>Feature</title>
        </sec>
        <sec id="sec-2-2-7">
          <title>Refinement</title>
        </sec>
        <sec id="sec-2-2-8">
          <title>Linear Block</title>
        </sec>
        <sec id="sec-2-2-9">
          <title>Output</title>
          <p>1D CNN</p>
        </sec>
        <sec id="sec-2-2-10">
          <title>Batch Norm</title>
        </sec>
        <sec id="sec-2-2-11">
          <title>Swish Activation</title>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Model Architecture</title>
        <p>
          Figure 2 depicts our model architecture used. It incorporates positional encodings into the input
spectrograms to enhance the spatial context of the data. Upon receiving the input spectrogram,
an additional singleton dimension is added. A positional encoding is then generated by linearly
interpolating values from 0 to 1 along the input height. This encoding is converted to half-precision
and reshaped to match the spectrogram dimensions. The positional encoding is concatenated with
the input spectrogram along the channel dimension, resulting in a combined input that includes both
spectral and positional information. This enhanced input is then passed through the backbone network
for feature extraction. The feature extractor used here is ViT / data-eficient image transformers (DeiT)
[
          <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
          ]. Post-feature extraction, the first element along the middle dimension is selected to reshape the
feature map accordingly. The feature refinement stage processes these reshaped features through a
series of operations, including a dropout layer, a 1D convolutional layer, batch normalization, and a
swish activation function, to further refine the feature representation. The final classification logits are
obtained by passing these refined features through the linear block, a fully connected layer. The output
contains the logits, enabling the model to make improved predictions by leveraging spectral and spatial
information.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Data Augmentation</title>
        <p>Data augmentation techniques were utilized to address the domain shift between the training and test
sets, as well as to handle weak and noisy labels. A concise overview of our used augmentation strategies
is provided in [14]. These include:
• Noise augmentations
• Short-noise bursts
• General mix up
• tanh-based distortion
• Gaussian noise
• Loudness normalization</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.5. Training Methods</title>
        <p>
          In the training process, we used various tools and libraries. The main framework we utilized was the
machine learning library PyTorch, along with additional libraries such as the Pytorch Image Models timm
for transformer backbone models [15], soundfile [ 16] and librosa [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] for audio and signal processing,
audiomentations for data augmentation, and scikit-learn for calculating metrics and creating stratified
folds for training / validation data splits.
        </p>
        <p>The depth and number of heads of ViT and DeiT were reduced to decrease the required computation
times. This reduction enabled us to omit the usage of pre-trained weights, such as from the ImageNet
dataset. As a result, we opted to train the model from scratch using the previous year’s BirdCLEF
competition dataset for 70 training epochs. These weights were in turn used as our pre-trained weights
for BirdCLEF 2024. ViT / DeiT require the image to be either 384 × 384 or 224 × 224 pixels in size, for
which we extracted non-overlapping 16 × 16 patches, arraigned in a 24 × 24 grid. Initially, we resized
our spectrogram to meet these dimensions. However, we encountered disappointing results due to
diluted essential features caused by noise domination and misallocation of attention. Therefore, we
resized and rearranged our spectrograms as described in Algorithm 1. After performing the spectrogram
extraction and model training, the obtained results could be further improved since transformers lack
an inherent notion of the order of the input spectrograms. Hence, we introduced a positional encoding
block that addresses this by injecting information about the positions of elements in the sequence
directly into the model. The model was trained with a learning rate of 5 · 10− 4 with a cosine annealing
learning rate scheduler to optimize performance over epochs. To prevent overfitting, a weight decay of
1 · 10− 6 was applied. The training process utilized a weighted sampler based on the number of samples
per class to ensure balanced representation during training. For loss computation, binary cross-entropy
with logits loss function was employed, incorporating secondary labels for a more nuanced learning.
The model utilized the AdamW optimizer. The number of epochs for training was set to 70, with a
batch size of 32.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Test results, evaluation, and discussion</title>
      <p>While our submissions with the name Arunodhayan using EficientNetB0 ranked 76 and 440 in the
public and private leaderboard, respectively, in our report documentation, we will focus on systematic
studies obtained with vision transformers. We achieved a macro average ROC score of 63 % on the
public test set (public leaderboard) and 62 % on the private test set (private leaderboard) by utilizing
vision transformers, excluding classes with no true positive labels. The top-ranked team achieved a
score of 69 % on the private test set. Our best transformer model would have achieved a rank of 300
among 972 participants on the private leaderboard (LB) (team TUC in Table 2).</p>
      <p>The main goal of this competition was to evaluate the test set in under 2 hours of runtime without
using a graphics processing units. The primary challenge we faced was the inconsistency in Kaggle’s
hardware, which made it dificult to predict the runtime of our models. Therefore, we could not
complete the test runs without altering our transformer models when the execution time exceeded 2
hours. To address this issue, we converted our model via OpenVINO [17]. After making the submission,
the execution time was reduced to 117 minutes, and we achieved a score of 61 %. Subsequently, we
experimented with reducing the number of depth and heads of the transformer models, which further
reduced the inference time. The respectively obtained results are presented in Table 2.
314
314
263
219
130
119
94
72
72
55
44
65
62</p>
      <p>CV</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion and outlook</title>
      <p>The BirdCLEF 2024 competition presented a unique challenge compared to its previous years. The use
of the metric “a macro-averaged ROC-AUC that excludes classes with no true positive labels” made it
dificult to compare cross-validation results and leaderboard scores, Furthermore, the importance of
developing eficient models that achieve a good balance between accuracy and speed was highlighted.
Our work explored the use of vision transformers and data-eficient image transformers as well as
the optimization of these models to run on CPU hardware by converting them via OpenVINO. Since
transformers require standard shapes such as 384 × 384 or 224 × 224, directly resizing the spectrograms
to the desired shape did not show the expected impact. As an alternative, we attempted to resize and
reshape the spectrograms by organizing 16 × 16 patch spectrograms into a 24 × 24 grid, which yielded
promising results but did not outperform the scores of competing submissions that used ensemble
models that included EficientNets. Due to the reduction in the complexity of the transformer, we could
not use pre-trained weights. Therefore, we trained a model using data from a previous competition,
facilitating faster model convergence. The model’s eficiency depends on the data’s quantity and quality.
In BirdCLEF 2024, we trained a model using a dataset recorded with high-quality microphones. However,
when we tested it using a soundscape dataset recorded with long-distance microphones, there was a
domain shift, meaning the test set was not representative of the training dataset. To address this, we
added non-bird event sounds to the training set as an augmentation, which made it more similar to the
test set and improved model performance.
[14] A. S. Kumar, T. Schlosser, S. Kahl, D. Kowerko, Improving learning-based birdsong classification
by utilizing combined audio augmentation strategies, Ecological Informatics (2024) 102699. URL:
https://www.sciencedirect.com/science/article/pii/S1574954124002413. doi:https://doi.org/
10.1016/j.ecoinf.2024.102699.
[15] R. Wightman, Pytorch image models, https://github.com/rwightman/pytorch-image-models, 2019.</p>
      <p>doi:10.5281/zenodo.4414861.
[16] M. R. Gogins, Soundfile, https://pypi.org/project/soundfile/, 2024. Version 0.12.1.
[17] O. T. Contributors, Openvino™ toolkit, 2024. URL: https://github.com/openvinotoolkit/openvino,
accessed: 2024-06-20.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Srivathsa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Arvind</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>CP</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sawant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. V.</given-names>
            <surname>Robin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Glotin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-P.</given-names>
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Planqué</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of BirdCLEF 2024:
          <article-title>Acoustic identification of under-studied bird species in the western ghats</article-title>
          ,
          <source>Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Espitalier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Botella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Deneu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Marcos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Estopinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Šulc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hrúz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          , et al.,
          <source>Overview of lifeclef</source>
          <year>2024</year>
          :
          <article-title>Challenges on species distribution prediction and identification</article-title>
          ,
          <source>in: International Conference of the CrossLanguage Evaluation Forum for European Languages</source>
          , Springer,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sampathkumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kowerko</surname>
          </string-name>
          , Tuc media computing at birdclef 2021:
          <article-title>Noise augmentation strategies in bird sound classification in combination with densenets and resnets, 2021</article-title>
          . URL: https://arxiv.org/abs/2106.10856, arXiv preprint arXiv:
          <volume>2106</volume>
          .
          <fpage>10856</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sampathkumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kowerko</surname>
          </string-name>
          , Tuc media computing at birdclef 2022:
          <article-title>Strategies in identifying bird sounds in complex acoustic environments</article-title>
          ,
          <source>in: Proceedings of the Working Notes of CLEF</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>2189</fpage>
          -
          <lpage>2198</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sohier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <year>Birdclef 2024</year>
          , https://kaggle.com/ competitions/birdclef-2024,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          , T. Denton, Birdclef 2021 - birdcall identification,
          <year>2021</year>
          . URL: https://kaggle.com/competitions/birdclef-2021, kaggle.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Navine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          , T. Denton,
          <year>Birdclef 2022</year>
          ,
          <year>2022</year>
          . URL: https: //kaggle.com/competitions/birdclef-2022, kaggle.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          , T. Denton,
          <year>Birdclef 2023</year>
          ,
          <year>2023</year>
          . URL: https://kaggle.com/competitions/ birdclef-2023, kaggle.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Stowell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Stylianou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Pamuła</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Glotin</surname>
          </string-name>
          ,
          <article-title>Automatic acoustic detection of birds through deep learning: the first bird audio detection challenge, Methods in Ecology and Evolution (</article-title>
          <year>2018</year>
          ). URL: https://arxiv.org/abs/
          <year>1807</year>
          .05812. arXiv:
          <year>1807</year>
          .05812.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>K. J. Piczak</surname>
          </string-name>
          ,
          <article-title>Esc: Dataset for environmental sound classification</article-title>
          ,
          <source>in: Proceedings of the 23rd ACM International Conference on Multimedia, ACM</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1015</fpage>
          -
          <lpage>1018</lpage>
          . doi:
          <volume>10</volume>
          .1145/2733373. 2806390.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>McFee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Ellis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>McVicar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Battenberg</surname>
          </string-name>
          ,
          <string-name>
            <surname>O.</surname>
          </string-name>
          <article-title>Nieto, librosa: Audio and music signal analysis in python</article-title>
          ,
          <source>in: Proceedings of the 14th python in science conference</source>
          , volume
          <volume>8</volume>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          , CoRR abs/
          <year>2010</year>
          .11929 (
          <year>2020</year>
          ). URL: https://arxiv.org/ abs/
          <year>2010</year>
          .11929. arXiv:
          <year>2010</year>
          .11929.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cord</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Douze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jégou</surname>
          </string-name>
          ,
          <article-title>Training data-eficient image transformers &amp; distillation through attention</article-title>
          , CoRR abs/
          <year>2012</year>
          .12877 (
          <year>2020</year>
          ). URL: https: //arxiv.org/abs/
          <year>2012</year>
          .12877. arXiv:
          <year>2012</year>
          .12877.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>