<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of BirdCLEF 2021: Bird call identification in soundscape recordings</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stefan Kahl</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tom Denton</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Holger Klinck</string-name>
          <email>Holger.Klinck@cornell.edu</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hervé Glotin</string-name>
          <email>herve.glotin@univ-tln.fr</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hervé Goëau</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Willem-Pier Vellinga</string-name>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Robert Planqué</string-name>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexis Joly</string-name>
          <email>alexis.joly@inria.fr</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CIRAD, UMR AMAP</institution>
          ,
          <addr-line>Montpellier</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Google LLC</institution>
          ,
          <addr-line>San Francisco</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Inria, LIRMM, University of Montpellier</institution>
          ,
          <addr-line>CNRS, Montpellier</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>K. Lisa Yang Center for Conservation Bioacoustics, Cornell Lab of Ornithology, Cornell University</institution>
          ,
          <addr-line>Ithaca</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>University of Toulon, AMU</institution>
          ,
          <addr-line>CNRS, LIS, Marseille</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Xeno-canto Foundation</institution>
          ,
          <addr-line>Groningen</addr-line>
          ,
          <country country="NL">Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Conservation of bird species requires detailed knowledge of their spatiotemporal occurrence and distribution patterns. Over the past decade, passive acoustic monitoring (PAM) has become an essential tool to collect data on birds on ecologically relevant scales. However, these PAM eforts generate extensive datasets, and their comprehensive analysis remains challenging. Improved and fully automated acoustic analysis frameworks are needed to advance the field of avian conservation. The 2021 BirdCLEF challenge focused on developing and assessing automated analysis frameworks for avian vocalizations in continuous soundscape data. The primary task of the challenge was to detect and identify all bird calls within the hidden test dataset. This paper describes how the various algorithms were evaluated and synthesizes the results and lessons learned.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;LifeCLEF</kwd>
        <kwd>bird</kwd>
        <kwd>song</kwd>
        <kwd>call</kwd>
        <kwd>species</kwd>
        <kwd>retrieval</kwd>
        <kwd>audio</kwd>
        <kwd>collection</kwd>
        <kwd>identification</kwd>
        <kwd>fine-grained classification</kwd>
        <kwd>evaluation</kwd>
        <kwd>benchmark</kwd>
        <kwd>bioacoustics</kwd>
        <kwd>passive acoustic monitoring</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Birds are widely used to monitor ecosystem health because they live in most environments
and occupy almost every niche within those environments. Traditionally, human observers
monitor bird populations by conducting point count surveys in an area of interest. At sampling
locations along a transect, the domain expert will visually and aurally count every bird in a
given time window (e.g., 3 or 5 minutes). However, conducting these surveys is time-consuming
and requires expert knowledge in the identification of birds. Because the number of observers
is typically limited, the spatiotemporal resolution of the surveys is limited as well.</p>
      <p>In contrast, passive acoustic monitoring (PAM) uses autonomous recording units (ARUs)
to monitor the acoustic environment, often continuously, in the vicinity of the deployment
location over extended periods (weeks to months). These data sets complement traditional bird
surveys and help to improve our ability to accurately monitor the status and trends of bird
populations and avian diversity more broadly.</p>
      <p>While PAM surveys are very cost-efective to conduct, the handling and analysis of vast
amounts of collected data (often tens or even hundreds of Terabytes) remains challenging.
In the past, researchers frequently subsampled the collected data or focused on specific call
types to circumnavigate this challenge. However, as a consequence, large amounts of data
remain untouched, and new analysis frameworks are required to mine these datasets thoroughly.
Efective analysis frameworks coming out of BirdCLEF and other competitions have the potential
to revolutionize how we monitor and conserve birds and biodiversity in the future.</p>
      <p>
        The LifeCLEF Bird Recognition Challenge (BirdCLEF) focuses on the development of reliable
analysis frameworks to detect and identify avian vocalizations in continuous soundscape data.
Launched in 2014, it has become one of the largest bird sound recognition competition in terms
of dataset size and species diversity, with multiple tens of thousands of recordings covering up
to 1,500 species [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. BirdCLEF 2021 Competition Overview</title>
      <p>
        Recent advances in the development of machine listening approaches to identify animal
vocalizations have improved our ability to comprehensively analyze long-term acoustic datasets
[
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. However, it remains dificult to generate analysis outputs with high precision and recall,
especially when targeting a high number of species simultaneously. Bridging the domain gap
between high-quality training samples (focal recordings) and noisy test samples (soundscape
recordings) is one of the most challenging tasks in the area of acoustic event detection and
identification. The 2021 BirdCLEF competition tackled this complex task and was held on Kaggle.
This year’s edition was a so-called ’code competition’ which encourages participants to publish
their code for the benefit of the community.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Goal and Evaluation Protocol</title>
        <p>The 2021 BirdCLEF challenge focused on developing and assessing automated analysis
frameworks for avian vocalizations in continuous soundscape data. The primary task of the challenge
was to detect and identify all bird calls within the hidden test dataset. Each soundscape was
divided into 5 second segments, and participants were tasked to return a list of audible species
for each segment. The row-wise micro-averaged F1 score was used for evaluation. In previous
editions, ranking metrics were used to assess the overall classification performance. However,
when applying bird call identification frameworks to real-world data, a suitable confidence
threshold must be set to balance precision and recall. The F1 score reflects this circumstance best.
However, the selected threshold can significantly impact the overall performance, especially
when applied to the hidden test dataset.</p>
        <p>Precision and recall were determined based on the total number of true positives (TP), false
positives (FP), and false negatives (FN) for each segment (i.e., row of the submission). More
formally:</p>
        <p>+</p>
        <p>,
Micro-Precision =</p>
        <p>Micro-Recall =</p>
        <p>+</p>
        <p>The micro-F1 score as harmonic mean of the micro-precision and micro-recall for each
segment was defined as:</p>
        <p>Micro-F1 = 2 ×</p>
        <p>Micro-Precision × Micro-Recall</p>
        <p>Micro-Precision + Micro-Recall</p>
        <p>The average across all (segment-wise) micro-F1 scores was used as the final metric. Segments
that did not contain a bird vocalization had to be marked with the ’nocall’ label, which acted as
an additional class label for non-events. The micro-averaged F1 score reduced the impact of rare
events, which only contributed slightly to the overall metric if misidentified. The classification
performance on common classes (i.e., species with high vocal presence) was well reflected in
the metric.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Dataset</title>
        <p>The 2021 BirdCLEF challenge featured one of the largest, fully annotated collections of
soundscape recordings from four diferent recording locations in North and South America.
Concerning real-world use cases, labels and metrics were chosen to reflect the vast diversity of bird
vocalizations and variable ambient noise levels in omnidirectional recordings.</p>
        <sec id="sec-2-2-1">
          <title>2.2.1. Training Data</title>
          <p>As in previous editions, training data were provided by the Xeno-canto community and consisted
of more than 60,000 (high-quality, focal) recordings covering 397 species from two continents
(North and South America). The maximum number of recordings for one species was limited
to 500, which only afected a dozen species and resulted in a highly unbalanced dataset. Nine
species contained less than 25 recordings, making it dificult to train reliable classifiers without
an appropriate few-shot approach. Participants were allowed to use various metadata to develop
their frameworks. Most notably, we provided detailed location information on recording sites
of focal and soundscape recordings, allowing participants to account for migration and spatial
distribution of bird species. Other metadata as secondary labels, call type, and recording quality
were also provided, allowing participants to apply pre- and post-processing schemes which
were not only based on audio inputs.
2.2.2. Test Data
In this edition, test data were hidden and only accessible to participants during the inference
process. This required participants to fine-tune their systems without knowing the value
distribution of the test data. This approach more closely resembles real-world use cases where
vast majorities of the recorded audio data have an unknown species composition. The hidden
test data contained 80 soundscape recordings of 10-minute duration covering four distinct
recording locations. All audio data were collected with passive acoustic recorders (SWIFT
recorders, K. Lisa Yang Center for Conservation Bioacoustics, Cornell Lab of Ornithology1)
deployed in Colombia (COL), Costa Rica (COR), the Sierra Nevada (SNE) of California, USA and
the Sapsucker Woods Sanctuary (SSW) in Ithaca, New York, USA. Expert ornithologists provided
annotations for a variety of quiet and extremely dense acoustic scenes (see Figure 1). In addition,
a validation dataset with 200 minutes (20x 10-minute recordings) of soundscape data were
also provided to allow participants to get a better understanding of the acoustic target domain.
Participants were allowed to use these data for validation or during training. Soundscapes from
the validation data only covered two (COR, SSW) of the four recording locations.</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>2.2.3. Colombia (COL) &amp; Costa Rica (COR)</title>
          <p>
            The Nespresso Biodiversity Index aims at quantifying the impact of eco-friendly cofee farms on
the avian diversity in surrounding areas. In collaboration with the Cornell Lab of Ornithology,
passive acoustic recorders were deployed on cofee farms in Colombia and Costa Rica to measure
the transformative efects of sustainable farming by analyzing large amounts of acoustic data.
Surveys are carried out twice a year to be able to capture how bird species are using the cofee
landscape when both Neotropical resident and migratory birds are present (Nov-Mar) and
around the peak of the breeding season for resident birds (April-Jun) [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]. Developing automated
detection and identification frameworks can help to provide reliable results over relevant
spatiotemporal scales and help researchers and decision-makers meet their conservation goals.
Expert annotators provided annotations for 40 soundscape recordings of 10-minute duration
collected at various recordings sites in Sep-Nov 2019. Additionally, ten fully annotated recordings
from Costa Rica were provided to participants as training or validation data. Soundscapes from
Colombian recording locations were exclusively part of the hidden test data. In contrast to many
1https://www.birds.cornell.edu/ccb/swiftone/
(a) COL recording habitat
(b) COR recording habitat
(c) SNE recording habitat
(d) SSW recording habitat
other tropical recordings sites, these soundscapes did not contain a very high vocal diversity
(due to the proximity to farmland). However, some rare species, for which only very few training
examples were available, were present in the data. Therefore, the data from these two recording
sites could be considered the most challenging of the competition.
          </p>
        </sec>
        <sec id="sec-2-2-3">
          <title>2.2.4. Sierra Nevada (SNE)</title>
          <p>
            Measuring the efects of landscape management activities in the Sierra Nevada, California, USA
can reveal a potential correlation with avian population density and diversity. Passive acoustic
monitoring can help to reduce the costs of observational studies and expand the scale at which
these studies can be conducted, provided there are robust bird call recognition systems [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ].
For this dataset, passive acoustic surveys were conducted in the Lassen and Plumas National
Forests in May-August 2018. Survey grid cells (4 km2) were randomly selected from a 6,000-km2
area, and recording units were deployed at acoustically advantageous locations (e.g., ridges
rather than gullies) within those cells. The recordings were made from 04:00 to 08:00 for
5–7 days between May 9 and June 10 (sunrise was roughly 05:35–05:50 during that time) [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ].
Because of this, call density was particularly high in this dataset - most soundscapes reflected
the species diversity during the dawn chorus. We randomly selected 20 expertly annotated
10-minute soundscape recordings, which were exclusively part of the hidden test data. Although
suficient amounts of training data were available for most annotated species, the high number
of overlapping sounds posed a significant challenge.
          </p>
        </sec>
        <sec id="sec-2-2-4">
          <title>2.2.5. Sapsucker Woods (SSW)</title>
          <p>
            As part of the Sapsucker Woods Acoustic Monitoring Project (SWAMP), the K. Lisa Yang Center
for Conservation Bioacoustics at the Cornell Lab of Ornithology deployed 30 SWIFT recorders in
the surrounding bird sanctuary area in Ithaca, NY, USA. This ongoing study aims to investigate
the vocal activity patterns and diversity of local bird species. The data are also used to assess
the impact of noise pollution on the behavior of birds. In 2018, expert birders annotated 20 full
days of audio data recorded between January and June 2017 and provided almost 80,000 labels
across randomly selected recordings. The 2019 edition of BirdCLEF [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] used twelve of these
days as test data and three as validation data. This year, the amount of test data were limited to
twenty 10-minute recordings, including previously unreleased data from this deployment. This
reduction became necessary to balance the test data and to reduce the bias towards a specific
dataset. Additionally, ten randomly selected recordings were provided as validation data to
allow participants to fine-tune their frameworks.
3. Results
1,004 participants from 70 countries on 816 teams entered the BirdCLEF 2021 competition and
submitted a total of 9,307 runs. Figure 3 illustrates the performance achieved by the top 50
collected runs. The private leaderboard score is the primary metric and was computed on
roughly 65% of the test data (based on a random split). It was revealed to participants after
the submission deadline to avoid probing the hidden test data. Public leaderboard scores were
visible to participants over the course of the entire challenge and were determined on 35% of
the entire test data.
          </p>
          <p>
            The baseline F1 score in this year’s edition was 0.4799 (public 0.5467), with all segments
marked as non-events (i.e., nocall), and 686 teams managed to score above this threshold. The
best submission achieved an F1 score of 0.6932 (public 0.7736), and the top 10 best-performing
systems were within only 2% diference in score. Top-scoring participants were required
to publish their code and associated write-up, lower-ranked participants opted to do so as
well, which resulted in a vast collection of publicly available online resources. It also allowed
organizers to inspect frameworks and approaches to assess the current state-of-the-art in
this domain. Unsurprisingly, deep convolutional neural networks were the go-to tool in this
competition, similar to previous editions. In many cases, participants chose to use of-the-shelve
architectures pre-trained on ImageNet (like EficientNet [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ], DenseNet [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ], or ResNet [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]).
The vast majority of systems used mel scale spectrograms as input data, applied mixup [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ] and
specaugment [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ] to diversify the training data. Provided metadata like time and location of
training recordings were used to estimate the occurrence probability of individual bird species
to post-filter predictions in many submissions.
          </p>
          <p>
            In addition to code repositories and online write-ups, eight teams also submitted full working
notes, which are summarized below:
Murakami, Tanaka &amp; Nishimori [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ], Team Dr.北村の愉快な仲間たち: This team
factorized the problem into three tasks: Nocall detection, bird call classification, and post-processing
based on provided metadata. The detection and classification backbone was a ResNet50, which
used pre-computed mel scale spectrograms as inputs. This team employed a sophisticated
scheme of post-processing, using gradient boosting decision trees to eliminate false detections.
The overall approach is computationally very eficient and required only few resources for
training and inference. The final submission achieved an F1 score of 0.6932 (public 0.7736).
Henkel, Pfeifer &amp; Singer [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ], Team new baseline: Of-the-shelve model architectures
pre-trained on ImageNet worked well in this competition. This team used an ensemble of
9 diferent pre-trained CNN architectures which used 30-second mel scale spectrograms as
input. Most notably, this team used a sample mixup scheme which diversified the training
data within and across samples by using non-event samples from previous editions and the
freefield1010 dataset [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ]. Context windows were split into 5-second segments for inference.
The final submission achieved an F1 score of 0.6893 (public 0.7998).
          </p>
          <p>
            Conde, et al. [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ], Team Ed and Satoru: This team’s solution also relied on the performance
of pre-trained models. In this case, the backbone consisted of a ResNeSt which used spectrograms
as input. During post-processing, the rolling mean of model confidence scores and clip wise
confidence scores were used to eliminate false positives. The final submission achieved an F1
score of 0.6738 (public 0.7801) and consisted of 13 diferent models, including best performing
models from the 2020 Kaggle bird call recognition competition.
          </p>
          <p>
            Puget [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ], Team CPMP: Transformers are the go-to model architecture for text processing.
Only recently, vision transformers achieved state-of-the-art results on ImageNet [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ] and even
for acoustic event recognition [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ]. This team tried to adapt vision transformers to the task of
bird call recognition and achieved very strong results without the need for large CNN model
ensembles. Again, mel scale spectrograms were used as input data, however, patch extraction
which accounted for the sequential nature of acoustic data allowed the use of pre-trained
transformer models despite visually distorted input data. The final input consisted of a 256x576
pixel spectrogram in which each of the 576 time steps contains 16x16 pixels. This way the
entire spectrogram can be reshaped to 24x24 patches of size 16x16 - the input size of pre-trained
vision transformers - while still exploiting the sequential structure of an audio signal. The best
performing submission achieved an F1 score of 0.6736 (public 0.8015) with the best performing
single transformer model achieving an F1 score of 0.6667 (public 0.7569).
          </p>
          <p>
            Schlüter [
            <xref ref-type="bibr" rid="ref21">21</xref>
            ], Team Jan Schlüter: This team used random 30-second crops from each training
recording with binary labels for primary and secondary species to train an 18 model ensemble
of CNNs. Notably, among mixup as a basic augmentation method, other strategies such as
magnitude warping and linear fading of high frequencies were used to emulate variations seen
in soundscape recordings. Predictions were made per file by pooling scores over consecutive
windows. These per-file predictions were then used to post-filter predictions for 5-second
segments. Using additional metadata such as location and time did not help to improve the
results. The best submission achieved an F1 score of 0.6715 (public 0.7595).
          </p>
          <p>
            Shugaev, et al. [
            <xref ref-type="bibr" rid="ref22">22</xref>
            ], Team Just do it: Strong nocall detection performance appeared to have
significant impact on the overall score in this year’s competition. This team manually labeled
non-event segments in the training data and used these segments to train a binary bird/no
bird detection system. Additionally, the nocall probability is used to weight weakly labeled
training samples during the main model training. This team explored diferent combinations
of spectrogram window length and threshold tuning to improve scores. The best submission
achieved an F1 score of 0.6605 (public 0.7736).
          </p>
          <p>
            Das &amp; Aggarwal [
            <xref ref-type="bibr" rid="ref23">23</xref>
            ], Team Error_404: This team designed custom convolutional model
architectures which used raw audio samples as 1D inputs. In addition, an elaborated scheme
of diferent attention mechanisms was employed. The two-step recognition process consisted
of a binary bird/no bird detector and a species classification model. Specaugment and Mixup
were used to diversify the training data, and the final submission achieved an F1 score of 0.6179
(public 0.6878) through the combination of 1D and 2D convolutional classifiers.
Sampathkumar &amp; Kowerko [
            <xref ref-type="bibr" rid="ref24">24</xref>
            ], Team Arunodhayan: Domain-specific augmentation
appeared to be key to improve the overall recognition performance. This team focused on data
augmentation, exploring the impact of diferent methods on the classification accuracy. During
local evaluation, this team was able to improve their baseline F1 score of 0.58 to 0.64 by adding
background samples comprised of a variety of non-events from diferent data sources. This
team also explored diferent schemes of weighting ensemble predictions, the best ensemble
consisted of 9 models with ResNet and DenseNet backbones, and achieved an F1 score of 0.5890
(public 0.6799).
          </p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>3.1. Per-Species Analysis</title>
        <p>Because the test-sets and labels were hidden from competitors, their approaches were mostly
blind to the test set composition. While this was by design (to encourage the creation of strong
general-purpose classifiers, avoiding overfitting to dataset-specific priors), it does inhibit the
kind of iterative problem-solving which is usually available in a research context.</p>
        <p>A correct identification requires correctly classifying whether a segment contains a bird,
assigning a logit to the presence of each particular species, and then determining from the
set of logits which birds are actually present. For the micro-averaged F1 score, the number
of “nocall" segments in the soundscapes dominates the number of segments containing any
particular species, so any solution’s performance on the nocall label has a very large impact on
the model’s success in the competition.</p>
        <p>Taking the top submissions, per-species F1 scores can be computed for each submission. All
species with less than five observations in the test were discarded, to reduce variance; this
leaves 92 species. These per-species F1 scores over the top 15 submissions were then aggregated.
A histogram of the max and mean per-species F1 scores is given in Figure 4.</p>
        <p>Had the metric been computed only on segments with birds (removing the no-bird
classiifcation problem), the top submissions would have ranked very diferently: The eight-place
submission (Team KDL2) had the best average F1 over species. We acknowledge, though, that
changing the metric would have also changed team’s tuning strategies, limiting the usefulness
of this contra-positive scenario.</p>
        <p>The per-species F1 scores was highly dependent on a submission’s (hidden) choice of
thresholds. As a result, it was hard to compare performance of particular models on a given species: A
small change of threshold could have a large impact on the F1 score. However, we examined
the species for which the top submissions were uniformly poor.</p>
        <p>The Fox Sparrow (Passerella iliaca, foxspa) and Green-Tailed Towhee (Pipilo chlorurus, gnttow)
are an interesting pair. Both had very low mean and max F1 scores across all submissions. Their
complex songs are well-known to be easily confused by human listeners. The Fox Sparrow has
a mean F1 score of 0.01; all but one competitor scored 0. Meanwhile, for gnttow the scores are a
bit better, with mean F1 0.17 and max F1 0.48. In addition to the complex song, the gnttow has a
diagnostic call which is easily identifiable.</p>
        <p>
          This indicates that models could improve significantly if they find some way to better
distinguish easily-confused species. There are a range of reasons to believe that this is possible. Some
easily-confused species can be distinguished by human experts. Secondly, the birds themselves
are likely able to distinguish their own species from other species [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. Thirdly, birds have
superior temporal integration for consecutive tones of diferent frequency [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]. This means
that there could be fine structure in these songs which a high-resolution ML algorithm may be
able to distinguish. Finally, in some cases, hard-to-distinguish species have non-overlapping
geographical distribution ranges. Inclusion of species-specific metadata in decision-making can
help in these cases.
        </p>
        <p>The second class of common failure was on birds with very few training resources, especially
in the tropical regions, where coverage in Xeno-canto is less comprehensive. The Steely-Vented
Hummingbird (Amazilia saucerottei, stvhum2) is a good example: The max (and mean) F1 score
2https://www.kaggle.com/hidehisaarai1213/birdclef2021-infer-between-chunk
(a)
(b)
for stvhum2 was 0.0 over the 15 submissions examined. This species has just 8 recordings in
the training data, most of which have low user ratings and a lot of background noise. The
few high quality recordings also demonstrate a fair amount of variability, though the recorded
vocalizations do have a recognizable ’hummingbird’ timbre.</p>
        <p>This indicates that improved few-shot learning may help with identification tasks, especially
for identification in geo-regions systemically lacking in training data. Improved few-shot
learning models can help with species where there are few training examples, but could also
help in cases where a species has an extremely variable song structure.</p>
      </sec>
      <sec id="sec-2-4">
        <title>3.2. Per-Location Analysis</title>
        <p>Recording equipment and annotation scheme were identical for all four recording sites. Because
of this, variation in recognition performance based on location-specific diferences could be
explored. Figure 5 shows average scores achieved by the top-15 participants across all recording
locations. Despite the uniform recording and label quality of all four datasets, significant
diferences in achieved scores were observed. When considering all ground truth annotations
(incl. nocall), submitted systems performed best on the SSW data. The SSW test data had the
highest number of nocall annotations (i.e., the highest number of 5-second segments without
bird vocalization) and by far the largest amount of focal training data across all audible bird
species. The performance of an automated distinction between bird calls and non-events (i.e.,
nocall-detector) can be considered one of the primary reason for the drop of 0.182 in F1 score
when only segments with a bird call were considered.</p>
        <p>This drop in scores can be observed for all locations. However, call density does not seem
to afect recognition performance as strongly as other factors. The SNE test dataset almost
entirely consisted of dawn chorus recordings with the highest call density of all four sites (1.19
calls per 5-second segment compared to 0.66 for COL, 0.5 for COR, and 0.52 for SSW). Yet,
performance across all segments that contained a bird was still very strong, with almost no
drop in precision. It appears that other dataset properties had significantly more impact on
the overall recognition performance. Diferences in scores were largest for the COR dataset,
which had the highest species diversity of all four sites. Additionally, the ground truth for this
site contained 6 species with less than 25 training samples each. These 6 species accounted
for 15% of all annotations, and we can assume that the combination of species diversity and
lack of training data significantly impacts the overall performance. The availability of target
domain (soundscape) training data does not seem to help to overcome the lack of focal training
data. The COL test data had the least amount of focal training samples across all audible bird
species, and the training data did not contain soundscape recordings from this site. However,
performance was significantly higher (+0.144 in F1 score) compared to the COR data for which
soundscape training data were available.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Conclusions and Lessons Learned</title>
      <p>The 2021 BirdCLEF competition held on Kaggle featured a vast collection of training and test
audio data. Participants were asked to develop robust bird call recognition frameworks to
identify avian vocalizations within 5-second segments of soundscape recordings. The
Xenocanto community provided the primary training data. The test datasets were collected during
passive acoustic surveys in North and South America. Species diversity, variability in call
densities, and lack of training data for rare species posed signicfiant challenges. Deep artificial
neural networks which used spectrogram as input data were used across the board and provided
remarkable results despite the domain gap between training and test data. Post-processing of
detections and the use of additional metadata were key to achieve top results. However, the
overall impact of metadata (e.g., location and time) was only incremental and significantly lower
than expected. It appears that these data may be more useful in scenarios with significantly
higher species diversity. Additionally, the competition setup encouraged the use of large model
ensembles, which might not have real-world applicability.</p>
      <p>Despite the high vocal activity in some test recordings, segments without audible bird
vocalizations dominated the count. Because of this, threshold tuning (especially for ’nocall’
segments) had a significant impact, often masking the real performance of the algorithms.
As a result, many participants relied on separate ’nocall’ detection systems to improve the
overall score. Additionally, in this year’s edition, of-the-shelve CNN backbones pre-trained on
ImageNet provided strong results without the need to investigate the design of domain-specific
architectures further. Hence, only very few participants explored alternative approaches like
transformers or 1D convolutional networks. We will try to address this in upcoming editions.</p>
      <p>Providing introductory code repositories and write-ups greatly improved participation and
encouraged fast workflow development without the need for domain knowledge. We noticed
that this year’s participants quickly adapted to the core challenges of the competition and
greatly appreciated the code notebooks provided by the organizers. In addition, prize money
for highest scoring solutions, gamification elements on Kaggle, and the overall outreach of the
platform had a significant impact on participation and helped attract a broader audience.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <p>Compiling these extensive datasets was a major undertaking, and we are very thankful to the
many domain experts who helped to collect and manually annotate the data for this competition.
Specifically, we would like to thank (institutions and individual contributors in alphabetic order):
Center for Avian Population Studies at the Cornell Lab of Ornithology (José Castaño, Fernando
Cediel, Jean-Yves Duriaux, Viviana Ruiz-Gutiérrez, Álvaro Vega-Hidalgo, Ingrid Molina, and
Alejandro Quesada), Google Bioacoustics Group (Julie Cattiau), K. Lisa Yang Center for
Conservation Bioacoustics at the Cornell Lab of Ornithology (Russ Charif, Rob Koch, Jim Lowe, Ashik
Rahaman, Yu Shiu, and Laurel Symes), Macaulay Library at the Cornell Lab of Ornithology
(Jessie Barry, Sarah Dzielski, Cullen Hanks, Jay McGowan, and Matt Young), Nespresso AAA
Sustainable Quality Program, Peery Lab at the University of Wisconsin, Madison (Phil Chaon,
Michaela Gustafson, M. Zach Peery, and Connor Wood), and the outstanding Xeno-canto
community.</p>
      <p>We would also like to thank Kaggle for helping us host this competition and sponsoring the prize
money. We are especially grateful for the incredible support and eforts of Addison Howard
and Sohier Dane, who helped process the dataset and set up the competition website. Thanks
to everyone who participated in this contest and shared their code base and write-ups with the
Kaggle community.</p>
      <p>All results, code notebooks and forum posts are publicly available at:</p>
      <p>https://www.kaggle.com/c/birdclef-2021</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lorieul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Cole</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Deneu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          , R. Ruiz De Castañeda, I. Bolon,
          <string-name>
            <given-names>H.</given-names>
            <surname>Glotin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Planqué</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-P.</given-names>
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dorso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Eggel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <source>Overview of LifeCLEF</source>
          <year>2021</year>
          :
          <article-title>a System-oriented Evaluation of Automated Species Identification and Species Distribution Prediction</article-title>
          ,
          <source>in: Proceedings of the Twelfth International Conference of the CLEF Association (CLEF</source>
          <year>2021</year>
          ),
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Clapp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Hopping</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Glotin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Planqué</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-P.</given-names>
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of BirdCLEF 2020:
          <article-title>Bird sound recognition in complex acoustic environments, in: CLEF task overview 2020, CLEF: Conference and Labs of the Evaluation Forum</article-title>
          , Sep.
          <year>2020</year>
          , Thessaloniki, Greece.,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Wood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Eibl</surname>
          </string-name>
          , H. Klinck,
          <article-title>BirdNET: A deep learning solution for avian diversity monitoring</article-title>
          ,
          <source>Ecological Informatics</source>
          <volume>61</volume>
          (
          <year>2021</year>
          )
          <fpage>101236</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Palmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Roch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Fleishman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.-M.</given-names>
            <surname>Nosal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Helble</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cholewiak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gillespie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <article-title>Deep neural networks for automated detection of marine mammal species</article-title>
          ,
          <source>Scientific reports 10</source>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V.</given-names>
            <surname>Ruíz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Román</surname>
          </string-name>
          , J.-Y. Duriaux,
          <source>The Sounds of Sustainability</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Wood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. D.</given-names>
            <surname>Popescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Keane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gutiérrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Sawyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Z.</given-names>
            <surname>Peery</surname>
          </string-name>
          ,
          <article-title>Detecting small changes in populations at landscape scales: A bioacoustic site-occupancy framework</article-title>
          ,
          <source>Ecological Indicators</source>
          <volume>98</volume>
          (
          <year>2019</year>
          )
          <fpage>492</fpage>
          -
          <lpage>507</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Wood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Chaon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Z.</given-names>
            <surname>Peery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <article-title>Survey coverage, recording duration and community composition afect observed species richness in passive acoustic surveys</article-title>
          ,
          <source>Methods in Ecology and Evolution</source>
          <volume>12</volume>
          (
          <year>2021</year>
          )
          <fpage>885</fpage>
          -
          <lpage>896</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.-R.</given-names>
            <surname>Stöter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Glotin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Planqué</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-P.</given-names>
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of birdclef 2019:
          <article-title>Large-scale bird recognition in soundscapes</article-title>
          ,
          <source>in: CLEF working notes</source>
          <year>2019</year>
          ,
          <article-title>CLEF: Conference and Labs of the Evaluation Forum</article-title>
          , Sep.
          <year>2019</year>
          , Lugano, Switzerland.,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          , Eficientnet:
          <article-title>Rethinking model scaling for convolutional neural networks</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>6105</fpage>
          -
          <lpage>6114</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Van Der</given-names>
            <surname>Maaten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <article-title>Densely connected convolutional networks</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>4700</fpage>
          -
          <lpage>4708</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cisse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. N.</given-names>
            <surname>Dauphin</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Lopez-Paz, mixup: Beyond empirical risk minimization</article-title>
          ,
          <source>arXiv preprint arXiv:1710.09412</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , C.-
          <string-name>
            <surname>C. Chiu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Zoph</surname>
            ,
            <given-names>E. D.</given-names>
          </string-name>
          <string-name>
            <surname>Cubuk</surname>
            ,
            <given-names>Q. V.</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Specaugment: A simple data augmentation method for automatic speech recognition</article-title>
          , arXiv preprint arXiv:
          <year>1904</year>
          .
          <volume>08779</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>N.</given-names>
            <surname>Murakami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tanaka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nishimori</surname>
          </string-name>
          ,
          <article-title>Birdcall Identification using CNN and Gradient Boosting Decision Trees with Weak and Noisy Supervision</article-title>
          ,
          <source>in: CLEF Working Notes</source>
          <year>2021</year>
          ,
          <article-title>CLEF: Conference and Labs of the Evaluation Forum</article-title>
          , Sep.
          <year>2021</year>
          , Bucharest, Romania,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>C.</given-names>
            <surname>Henkel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pfeifer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Singer</surname>
          </string-name>
          ,
          <article-title>Recognizing bird species in diverse soundscapes under weak supervision</article-title>
          ,
          <source>in: CLEF Working Notes</source>
          <year>2021</year>
          ,
          <article-title>CLEF: Conference and Labs of the Evaluation Forum</article-title>
          , Sep.
          <year>2021</year>
          , Bucharest, Romania,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>D.</given-names>
            <surname>Stowell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Plumbley</surname>
          </string-name>
          ,
          <article-title>An open dataset for research on audio field recording archives: freefield1010, arXiv preprint</article-title>
          arXiv:
          <volume>1309</volume>
          .5275 (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M. V.</given-names>
            <surname>Conde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. D.</given-names>
            <surname>Movva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Agnihotri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bessenyei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shubham</surname>
          </string-name>
          ,
          <article-title>Weakly-Supervised Classification and Detection of Bird Sounds in the Wild. A BirdCLEF 2021 Solution</article-title>
          , in
          <source>: CLEF Working Notes</source>
          <year>2021</year>
          ,
          <article-title>CLEF: Conference and Labs of the Evaluation Forum</article-title>
          , Sep.
          <year>2021</year>
          , Bucharest, Romania,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>J.-F. Puget</surname>
          </string-name>
          ,
          <article-title>STFT Transformers for Bird Song Recognition</article-title>
          ,
          <source>in: CLEF Working Notes</source>
          <year>2021</year>
          ,
          <article-title>CLEF: Conference and Labs of the Evaluation Forum</article-title>
          , Sep.
          <year>2021</year>
          , Bucharest, Romania,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          , et al.,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>11929</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-A.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Glass</surname>
          </string-name>
          , Ast: Audio spectrogram transformer,
          <source>arXiv preprint arXiv:2104.01778</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schlüter</surname>
          </string-name>
          ,
          <article-title>Learning to monitor birdcalls from weakly-labeled focused recordings</article-title>
          ,
          <source>in: CLEF Working Notes</source>
          <year>2021</year>
          ,
          <article-title>CLEF: Conference and Labs of the Evaluation Forum</article-title>
          , Sep.
          <year>2021</year>
          , Bucharest, Romania,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>M.</given-names>
            <surname>Shugaev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tanahashi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhingra</surname>
          </string-name>
          , U. Patel,
          <string-name>
            <surname>BirdCLEF</surname>
          </string-name>
          <year>2021</year>
          :
          <article-title>Building a birdcall segmentation model based on weak labels</article-title>
          ,
          <source>in: CLEF Working Notes</source>
          <year>2021</year>
          ,
          <article-title>CLEF: Conference and Labs of the Evaluation Forum</article-title>
          , Sep.
          <year>2021</year>
          , Bucharest, Romania,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>G. Das</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Aggarwal</surname>
          </string-name>
          ,
          <string-name>
            <surname>Bird-Species Audio</surname>
            <given-names>Identification</given-names>
          </string-name>
          ,
          <source>Ensembling 1D + 2D Signals, in: CLEF Working Notes</source>
          <year>2021</year>
          ,
          <article-title>CLEF: Conference and Labs of the Evaluation Forum</article-title>
          , Sep.
          <year>2021</year>
          , Bucharest, Romania,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sampathkumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kowerko</surname>
          </string-name>
          , TUC Media Computing at BirdCLEF 2021:
          <article-title>Noise augmentation strategies in bird sound classification in combination with DenseNets and ResNets</article-title>
          ,
          <source>in: CLEF Working Notes</source>
          <year>2021</year>
          ,
          <article-title>CLEF: Conference and Labs of the Evaluation Forum</article-title>
          , Sep.
          <year>2021</year>
          , Bucharest, Romania,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>C. K. Catchpole</surname>
            ,
            <given-names>P. J.</given-names>
          </string-name>
          <string-name>
            <surname>Slater</surname>
          </string-name>
          ,
          <article-title>Bird song: biological themes and variations</article-title>
          , Cambridge university press,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Dooling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lohr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Dent</surname>
          </string-name>
          ,
          <article-title>Hearing in birds and reptiles</article-title>
          ,
          <source>in: Comparative hearing: birds and reptiles</source>
          , Springer,
          <year>2000</year>
          , pp.
          <fpage>308</fpage>
          -
          <lpage>359</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>