<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Transfer Learning with Pseudo Multi-Label Birdcall Classification for DS@GT BirdCLEF 2024</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anthony Miyaguchi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adrian Cheung</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Murilo Gustineli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ashley Kim</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Georgia Institute of Technology</institution>
          ,
          <addr-line>North Ave NW, Atlanta, GA 30332</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present working notes for the DS@GT team on transfer learning with pseudo multi-label birdcall classification for the BirdCLEF 2024 competition, focused on identifying Indian bird species in recorded soundscapes. Our approach utilizes production-grade models such as the Google Bird Vocalization Classifier, BirdNET, and EnCodec to address representation and labeling challenges in the competition. We explore the distributional shift between this year's edition of unlabeled soundscapes representative of the hidden test set and propose a pseudo multi-label classification strategy to leverage the unlabeled data. Our highest postcompetition public leaderboard score is 0.63 using BirdNET embeddings with Bird Vocalization pseudo-labels. Our code is available at github.com/dsgt-kaggle-clef/birdclef-2024.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Transfer Learning</kwd>
        <kwd>Dataset Annotation</kwd>
        <kwd>Embeddings</kwd>
        <kwd>Association Rule Mining</kwd>
        <kwd>Google Bird Vocalization Classifier</kwd>
        <kwd>BirdNET</kwd>
        <kwd>EnCodec</kwd>
        <kwd>CEUR-WS</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Birdcall Classification Overview</title>
      <p>
        Birdcall classification is a challenging task due to the variability in bird vocalizations, the presence of
background noise, and the large number of species to classify. In addition, the measured data comes in
the form of audio recordings, which are high-dimensional and require specialized processing techniques.
Many successful approaches to birdcall classification utilize convolutional neural networks (CNNs) to
extract features from audio spectrograms. Audio spectrograms are a time-frequency representation
of the audio signal, which are extracted using the short-time Fourier transform (STFT) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], often with
additional preprocessing steps such as mel-frequency scaling. The spectrograms are represented as 2D
images, which utilize techniques in the rich literature surrounding image classification.
      </p>
      <p>
        BirdNET is a popular birdcall classification model that utilizes the spectrogram-CNN approach. It
is widely distributed in the field due to its high accuracy and ease of use on mobile devices [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The
Google Bird Vocalization Classifier is another model using EficientNet-B1, a similar CNN architecture,
and is trained on many soundscapes. It was released alongside the BirdCLEF 2023 competition and has
more than 10,000 species in its output space [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Domain Knowledge Transfer via Embedding Spaces</title>
      <p>
        Neural networks are universal function approximators that map some input space to an arbitrary output
space [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. A neural network’s intermediate layers can be considered a manifold that typically projects
high dimensional data to lower dimensional spaces. Transfer learning is a technique that leverages the
learned representations of a model trained on one task to improve the performance of a model on a
diferent task. In the context of birdcall classifications, raw audio data is projected onto a manifold
optimized to discriminate between bird species. The projections are called embeddings and can be used
as features in downstream tasks to transfer knowledge to a new but similar domain. Few-shot learning
on global bird embeddings tends to be efective on new domains as per Ghani et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        We visually inspect learned embeddings by projecting them into two or three dimensions to reveal
meaningful semantic structures that can be qualitatively interpreted by humans. We use PaCMAP, a
dimensionality reduction technique that preserves local and global structure on a manifold via pairwise
relationships [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], to visualize embeddings from the Bird Vocalization Classifier, BirdNET, and EnCodec.
In Figure 1, we project the top five species by frequency in the soundscape dataset. The first two are
domain-specific models trained on bird vocalizations, while the latter is a general-purpose neural audio
codec. EnCodec is particularly interesting because it is not trained on bird vocalizations but rather a
diverse set of audio data using a self-supervised learning objective guided by self-attention mechanisms
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. It is also perceptually optimized for audio compression, preserving meaningful structures in the
embedding space, such as bird vocalizations over static background noise.
      </p>
      <p>We explore the efectiveness of transfer learning, where bird vocalization predictions are surrogates for
true labels. We hypothesize that transfer learning is an efective technique for the competition because
existing models capture underlying structures amenable to optimization by simple linear classifiers. We
quantify how well we can learn domain-specific adaptations between diferent embedding spaces and
how well each of these models is suited to capture the underlying structure of the data.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Exploratory Data Analysis</title>
      <p>
        We perform an exploratory analysis on the training and unlabeled soundscape datasets to understand
species distribution and their co-occurrence patterns. We hypothesize a shift in species distribution
between the training and unlabeled soundscape datasets due to the diferences in the recording
environments observed in downstream domain knowledge transfer. The training dataset contains 182
species obtained from crowd-sourced Xeno-Canto recordings. Because they are crowd-sourced, the
training data is likely biased toward clear and distinct vocalizations that typically occur in isolation. The
soundscape dataset is a collection of 4-minute soundscapes from Western Ghats, India, representative
of the hidden test set in the competition [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The soundscape is likely to be more intermittent and
contain vocalizations that are less distinct and overlapping due to the lack of human-directed attention
to the recording process.
      </p>
      <p>We use the Bird Vocalization model to extract the embeddings and logits from the datasets in
5second intervals. The training dataset comprises 217k discrete intervals totaling 302 hours, while the
soundscape dataset has 407k intervals totaling 566 hours. We assume that an interval contains a call
if the maximum logit value run through a sigmoid function exceeds a threshold of 0.5. The training
dataset has a higher density of calls, with 62% of intervals containing at least one call, compared to
8.8% in the soundscape dataset.</p>
      <p>We compute the relative frequency of species overall discrete intervals and compare their distributions
by the ranked frequency in the soundscape dataset in Figure 2. There is a notable discordance between
the two distributions, with many species in the training set unrepresented in the unlabeled soundscapes.
We observe that top species in each dataset do not align in Table 1 using the raw frequency of occurrences.</p>
      <p>
        We use a frequent-pattern mining algorithm, FPGrowth [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], to identify co-occurrence patterns in
the soundscape dataset. In Table 2, we observe that co-occurrences of species can appear more often
than individual species alone. The frequent itemsets give us a rough estimate of how many birds we
can expect to see in a single recording. In Figure 3, we plot the distribution of normalized itemset sizes
in the training and soundscape datasets. We observe an approximately normal distribution of sizes
centered around four to six species per recording. The training set is skewed toward smaller itemsets,
likely due to biases in the data collection process where individuals are more likely to record and upload
isolated calls.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Methodology</title>
      <p>The experiments are run over several modular stages. We implement an end-to-end workflow that
applies domain-specific fine-tuning to state-of-the-art models for birdcall classification. We quantify
diferences between choices of dataset, architecture, and training losses. The second part of the
experiment focuses on transfer learning using the model as a surrogate, where the model’s predictions
are used as labels for transfer learning on audio classification models. In particular, we study a widely
distributed birdcall-specific convolutional neural network and a self-supervised neural audio codec for
encoding and decoding.</p>
      <sec id="sec-5-1">
        <title>5.1. Transfer Learning</title>
        <p>The Google Bird Vocalization Classification model is the main surrogate transfer learning experiment,
focused on version 4 of google/bird-vocalization-classifier on Kaggle, corresponding to
version 1.3 on the TensorFlow hub. We directly compute a prediction for the competition by selecting
the competition subset, filling in the missing species with negative infinity, and computing the sigmoid
of the logits. We perform fine-tuning of the model by training a new classification head on the training
dataset using the thresholded predictions of the model as pseudo-labels for the multi-label classification
task. We take advantage of the species label of the folder according to Section 5.3 as one form of
augmentation. We also fine-tune the model on the unlabeled soundscape data.</p>
        <p>We experiment with diferent losses to optimize the multi-label classifier, including binary
crossentropy (BCE), asymmetric loss (ASL), and sigmoidF1 which are explained in Section 5.4. We
experimented with a hidden layer to increase the capacity of the models. The model is trained for 20 epochs
with a batch size of 1,000 and a learning rate calculated by PyTorch Lightning. The training dataset
is split with an 80-20 train-validation split. The model is trained on a single NVIDIA L4 on a Google
Cloud Platform (GCP) g2-standard-8 instance with 8 vCPU, 16GB of memory, and 375GB of local NVME
storage for dataset caching and model checkpoints.</p>
        <p>We use BirdNET V2.4 through joeweiss/birdnetlib and EnCodec through
facebookresearch/encodec v0.1.1 as comparisons for knowledge transfer. Though both
the Bird Vocalization model and BirdNET provide predictions for classification, the former provides a
more extensive set of species that overlaps with this year’s competition. We ignore the outputs of the
BirdNET model for our experiments and focus on learning the distribution of the Bird Vocalization
model’s outputs.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Data Preprocessing</title>
        <p>We pre-compute the embeddings and predictions of the Bird Vocalization model on the training and
unlabeled soundscape datasets into a binary, columnar format that is easily accessible from network
storage. The embeddings are in a ℛ1280 space, while predictions are limited to the competition’s species
set. If the species is not present, its prediction is set to zero by assigning negative infinity to the logit
output. To save computation, we also pre-compute and join the embeddings from BirdNET and EnCodec
with the predictions of the Bird Vocalization model.</p>
        <p>For BirdNET, we must align the model’s input size of 48kHz of 3 seconds to the 32kHz of 5 seconds
that both the Bird Vocalization model and BirdCLEF competition expect. We take the mean of the
embeddings of the 5-second audio clip with a 1-second stride for the 0th and 2nd seconds. This provides
coverage of the entire audio clip while limiting the computational burden of encoding. We take 5-second
embedding tokens at 24kHz, and limit the bandwidth of Encodec to 1.5kbps for an embedding space of
ℛ5× 150. Increasing the bandwidth to 3kpbs leads to an embedding space of ℛ5× 300. We qualitatively
inspect the embeddings through a cluster analysis in Figure 1, noting the relative dificulty of separating
common classes within the dataset.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Pseudo Multi-Label Construction</title>
        <p>The training dataset lacks traditional labels for supervised learning, as the 5-second intervals in each
recording are not labeled with the species present. We use pseudo-labels derived from the thresholded
predictions of a surrogate model, which are not human-verified ground truth. Additionally, we use the
folder species as an extra label for further training the model. The thresholded predictions are defined
as a function of the model’s output ^ and a threshold threshold, with the sigmoid function denoted by  .</p>
        <p>We define an indicator variable 1call that determines whether the model output detects a birdcall,
which occurs when any species prediction is positive.</p>
        <p>^ =  (g()) &gt; threshold</p>
        <p>||
1call() = ∑︁  &gt; 0</p>
        <p>=0
1species() =
{︃1
0
when  ∈ 
when  ∈/</p>
        <p>
          We also generate a one-hot encoding of the folder that the current audio belongs to 1species, where 
is the set of species in the folder.
(1)
(2)
(3)
(4)
(5)
We experiment with diferent losses to optimize the multi-label classifier. The competition evaluation
uses a modified ROC-AUC that skips classes with no true-positive labels. We utilize MultiLabelAUROC
from the torchmetrics library as the primary learning metric. We also consider the macro-F1 score
as a secondary metric, which was utilized in the 2022 edition of the BirdCLEF competition [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. This
metric allows us to inspect other aspects of the loss functions we consider in our experiments.
        </p>
        <sec id="sec-5-3-1">
          <title>5.4.1. Binary Cross-Entropy</title>
          <p>Binary cross-entropy is a loss function used for binary classification. It is suitable for multi-label
classification as it treats each label as an independent binary classification problem. We use this loss as
a baseline due to its simple interpretation and absence of hyperparameters.</p>
          <p>Finally, we can define our modified label as the intersection of the model’s output and the species of
the folder. This can be implemented as a vectorized operation in PyTorch.</p>
          <p>^species = ^ ∨ (1call(^) ∧ 1species())</p>
          <p>We use a threshold of threshold = 0.5 when defining all labels. Experiments on the unlabeled
soundscapes do not have the additional information provided in the training dataset, and thus we are
limited to the pseudo-labels ^.</p>
        </sec>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Training Losses</title>
        <p>= −

∑︁ , log(,)
=1</p>
        <sec id="sec-5-4-1">
          <title>5.4.2. Asymmetric Loss (ASL)</title>
          <p>
            The asymmetric loss [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ] penalizes false positives and false negatives diferently. This construction
dynamically down-weights easy negative samples, hard thresholds them, and ignores misclassified
samples. This loss is well-suited for our problem domain since we have fuzzy labels from another model
initially intended for single-label classification.
          </p>
          <p>= ︂{ −+ == ((1− ) −) l+oglo(g1(−)) (6)</p>
          <p>
            The loss is defined in terms of the probability of the network output  and hyper-parameters  +
and  − . Setting  + &gt;  − emphasizes positive examples while setting both terms to 0 yields binary
cross entropy. We sweep over parameters  + ∈ {0, 1} and  − ∈ {0, 2, 4}, while the default values are
 + = 1 and  − = 4,
5.4.3. sigmoidF1
The sigmoidF1 loss [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ] optimizes the F1 score directly by creating a diferentiable approximation of
the F1 score. Though the competition does not score with F1, it provides a useful point of comparison
with other losses. We first define the true positive, false positive, false negative, and true negative terms
as a function of the sigmoid function.
(7)
(8)
(9)
where S(y^) is the sigmoid function applied to the model’s output y^.
          </p>
          <p>̃︀ = ∑︁ S(y^) ⊙ y ˜ = ∑︁ S(y^) ⊙ (1 − y)
˜ = ∑︁(1 − S(y^)) ⊙ y ˜ = ∑︁(1 − S(y^)) ⊙ (1 − y)
(; ,  ) =</p>
          <p>1
1 + exp(−  ( +  ))
Then we define the F1 score as a function of the true positive, false positive, and false negative terms.
ℒ̃︁1 = 1 − ̃︁1,
where
̃︁1 =
2̃︀ + ̃ ︁ + ̃ ︁</p>
          <p>We are given two hyper-parameter  = −  and  =  . We sweep over parameters  ∈
{− 1, − 15, − 30} and  ∈ {0, 1} as suggested in the author’s experiments.
2̃︀</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
      <p>We obtain results for various models on the leaderboard via code submission on Kaggle. We report
the best validation F1 and AUROC scores, together with private and public leaderboard scores. All
submissions were made past the competition deadline with the exception of the starter Keras notebook.
We submit a model that predicts 0 for every species on the leaderboard, leading to a private and public
score of 0.5. We submit the predictions from the Bird Vocalization model and obtain a private and public
score of 0.516625 and 0.556097 respectively.</p>
      <sec id="sec-6-1">
        <title>6.1. Loss Comparisons</title>
        <p>In Table 3, we train a linear classifier head against combinations of BCE and ASL with the addition
of the species label logic. We report our validation F1 and AUROC scores alongside the private and
public scores. We note that AUROC quickly saturates against the validation set used in the training
dataset. The validation F1 score however correlates more strongly with the leaderboard scores. Using
the species labels typically increases the score by 0.05 e.g. ASL with default parameters goes from 0.529
to 0.576 in the public leaderboard.</p>
        <p>We experimented with adding a hidden layer behind the classification head to encourage the model
to learn more complex patterns. Using ASL as the loss function, we varied the hyperparameters listed
in Table 4. We confirmed the eficacy of the species logic but noted that the scores were marginally
lower than those of the linear models. Additionally, we found that the default parameters of ASL are
efective in most tasks, with minimal tuning needed for good performance on domain-specific tasks.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Embedding Model Comparisons</title>
        <p>We summarize the performance of each loss function across the CNN-based models in Table 5. Due
to CPU-time limitations on notebook runtime, we do not include an EnCodec-based model. Our best
model on the public leaderboard uses BirdNET embeddings and the BCE loss. BirdNET embeddings
consistently perform better with linear models, despite the origin of the labels being the Bird Vocalization
model. Access to the species label from the parent folder consistently improves scores. While BCE
performs well, this behavior is not indicated by our validation and private test metrics alone.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Dataset Comparisons</title>
        <p>In Table 6, we compare the performance of linear models trained on the soundscape dataset using
ASL as the main loss. We observe two main results: (1) BirdNET embeddings outperform the bird
vocalization model by 0.03 on the public leaderboard and (2), models trained on the soundscape dataset
are less efective than those trained on the distribution of the training dataset. This may be attributed
to ASL’s dynamic downscaling of easily classified negative labels, making the contribution of training
labels more significant than the similarity to the test distribution.</p>
      </sec>
      <sec id="sec-6-4">
        <title>6.4. Inference Runtime</title>
        <p>We profile each model to estimate the time required to process all test soundscapes, as shown in Table 7.
The Python profiler measures the time spent in each function and the number of function calls. Reading
all audio into chunked arrays from disk into memory, our baseline takes approximately one minute.</p>
        <p>The Bird Vocalization model did not complete within the Kaggle’s time constraints, taking nearly three
hours according to our estimates. We compile the model using TensorFlow Lite at runtime, optimizing
operations for the hardware while allowing fallback to non-lite operations. This compilation process
results in an order-of-magnitude performance increase, leaving a substantial margin for additional
computation. The linear classification head adds only an extra half-hour of computation. The BirdNET
model also runs well within time constraints as it is compiled with TensorFlow Lite.</p>
        <p>
          EnCodec exceeds the time budget, taking 2.4 hours for the base model. Experimenting with OpenVINO
[
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] and applying data-independent quantization and compression did not improve inference speed.
        </p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Discussion</title>
      <sec id="sec-7-1">
        <title>7.1. Transfer Learning Experimentation</title>
        <p>Our transfer learning experiments using the Bird Vocalization classifier exhibit diferent behaviors
between the private and public leaderboards. While fine-tuned models outperform the base model when
trained on the subset of species provided for the competition, we hypothesize a shift in the species
distribution between the private and public test sets. The Bird Vocalization model is trained on a more
balanced dataset drawn from a larger set of species, whereas our transfer learning techniques rely on
pseudo-labeling from the donor model, which may not be well-calibrated for this task. We did not
account for the skew in the training data, apparent from the distribution of audio of each species.</p>
        <p>We address label skew through diferent loss function choices. We use a secondary metric during
training to provide another axis to compare models. When fine-tuning the Bird Vocalization classifier
to learn the outputs from the original classifier head, the AUROC loss converges close to unity across
various architectures. However, diferent losses exhibit varying learning behaviors against the
F1score, with some designed to be better surrogates than binary cross-entropy loss. During transfer
learning, these losses provide a smooth, monotonic increase to the validation F1-score, indicating that
Bird Vocalization embeddings ofer a "good" representation of domain-specific data for the multi-label
problem. We observe diferent behaviors in other embedding spaces, supported by our clustering charts.</p>
        <p>To address skew in the training dataset, the organizers provide unlabeled soundscapes representative
of the hidden-test dataset. We discuss the distributional shift between species and frequent itemsets in
Section 4. Figure 5 shows the active intervals of calls, revealing diferences in data geometry. The train
datasets at the bottom have tightly clustered logits, likely representing peaks in species probability
distributions. The embeddings form a large central cluster with several outliers, probably representing
distinctive calls. Conversely, the soundscape logit space forms two major clusters, reflecting the smaller
set of species present. Thus, soundscape embeddings should closely reflect clusters of birdcalls. It
would be interesting to explore how well we can discriminate between recording sites, as location likely
correlates with species distribution and co-occurrence patterns.</p>
        <p>We expect soundscapes to better represent the species distribution in the hidden test set. However,
our results show that the models trained in the soundscapes perform worse than those trained on the
original dataset. Although the addition of soundscapes adds an interesting dimension to the competition,
it requires more than cursory experimentation to incorporate into modeling efectively.</p>
      </sec>
      <sec id="sec-7-2">
        <title>7.2. Self-Supervised Neural Codecs</title>
        <p>
          We find that EnCodec does not transfer well with similar experiments involving the linear and two-layer
classifiers, achieving validation F1-scores below 0.1. Adding an LSTM layer to handle the sequential
nature of EnCodec embeddings did not improve the scores. A much deeper model, similar to the
EnCodec decoder [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], is likely needed to learn from the quantized embeddings, but this is not feasible
within the competition’s inference time constraints.
        </p>
        <p>Additionally, EnCodec is computationally expensive and dificult to adapt to the constrained
submission environment. The Python profiler identified model inference as the bottleneck, with most time
spent on EnCodec inference. OpenVINO post-training optimizations for quantizing and compressing
weights do not significantly improve inference throughput, likely due to existing optimizations in the
upstream library. A 1.5× speedup is needed to use EnCodec in our pipeline, indicating that further
optimizations are required to leverage neural codecs based on large datasets trained with attention and
self-supervision.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Future Work</title>
      <p>Exploiting co-occurrence species information as a prior to the learning process could be beneficial.
We have demonstrated frequent pattern mining to obtain co-occurrence distributions and quantify
diferences from the training dataset. Confident relationships extracted from the data can be visualized,
as shown in Figure 6, and used to reshape the probability distribution of an existing classifier to better
represent the posterior of the unlabeled soundscape.</p>
      <p>
        We aim to explore alternative parameterizations of sequential models that are computationally viable
for future competitions. The competition’s trade-ofs favor compact domain-specific models over
large neural networks, focusing on linearithmic algorithms like the Fast Fourier Transform for input
representation. Finding a pre-trained neural audio codec with fewer parameters that fit within our
computational budget and pass human perceptual tests could be viable. Alternatively, training models
from scratch using diferent architectures via distillation methods, compatible with the encoder-decoder
architecture used in EnCodec, could be explored. State-space models like Mamba [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] provide an
appealing alternative to attention-based methods, potentially staying within our computational budget.
      </p>
    </sec>
    <sec id="sec-9">
      <title>9. Conclusion</title>
      <p>Our study demonstrates the efectiveness of transfer learning in birdcall classification using embeddings
from pre-trained models like Google’s Bird Vocalization Classification Model and BirdNET. These
embeddings capture meaningful structures that are beneficial for multi-label classification, although
they do not outperform many top models in the competition. Our best-performing model, which uses
BirdNET embeddings and Bird Vocalization pseudo labels to train a linear classifier, achieved a 0.63
score on the post-competition public leaderboard. Future work will focus on optimizing computational
eficiency and exploring alternative model architectures to better handle the sequential nature of audio
data. We also plan to incorporate species co-occurrence patterns to further enhance classification
accuracy. Our code is available at github.com/dsgt-kaggle-clef/birdclef-2024.</p>
    </sec>
    <sec id="sec-10">
      <title>Acknowledgments</title>
      <p>Thank you to the Data Science at Georgia Tech (DS@GT) club for providing hardware for experiments,
and to the organizers of BirdCLEF and LifeCLEF for hosting the competition.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Srivathsa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Arvind</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>CP</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sawant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. V.</given-names>
            <surname>Robin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Glotin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-P.</given-names>
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Planqué</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of BirdCLEF 2024:
          <article-title>Acoustic identification of under-studied bird species in the western ghats</article-title>
          ,
          <source>Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Espitalier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Botella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Deneu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Marcos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Estopinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Šulc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hrúz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          , et al.,
          <source>Overview of lifeclef</source>
          <year>2024</year>
          :
          <article-title>Challenges on species distribution prediction and identification</article-title>
          ,
          <source>in: International Conference of the CrossLanguage Evaluation Forum for European Languages</source>
          , Springer,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Navine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Glotin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-P.</given-names>
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Planqué</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of birdclef 2022:
          <article-title>Endangered bird species recognition in soundscape recordings</article-title>
          .,
          <source>in: CLEF (Working Notes)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>1929</fpage>
          -
          <lpage>1939</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Durak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Arikan</surname>
          </string-name>
          ,
          <article-title>Short-time fourier transform: two fundamental properties and an optimal implementation</article-title>
          ,
          <source>IEEE Transactions on Signal Processing</source>
          <volume>51</volume>
          (
          <year>2003</year>
          )
          <fpage>1231</fpage>
          -
          <lpage>1242</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Wood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Eibl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <article-title>Birdnet: A deep learning solution for avian diversity monitoring</article-title>
          ,
          <source>Ecological Informatics</source>
          <volume>61</volume>
          (
          <year>2021</year>
          )
          <fpage>101236</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wisdom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Hershey</surname>
          </string-name>
          ,
          <article-title>Improving bird classification with unsupervised sound separation</article-title>
          ,
          <source>in: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>636</fpage>
          -
          <lpage>640</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun</surname>
          </string-name>
          , Y. Bengio, G. Hinton,
          <article-title>Deep learning</article-title>
          , nature
          <volume>521</volume>
          (
          <year>2015</year>
          )
          <fpage>436</fpage>
          -
          <lpage>444</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <article-title>Global birdsong embeddings enable superior transfer learning for bioacoustic classification</article-title>
          ,
          <source>Scientific Reports</source>
          <volume>13</volume>
          (
          <year>2023</year>
          )
          <fpage>22876</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rudin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shaposhnik</surname>
          </string-name>
          ,
          <article-title>Understanding how dimension reduction tools work: An empirical approach to deciphering t-sne, umap, trimap, and pacmap for data visualization</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>22</volume>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>73</lpage>
          . URL: http://jmlr.org/papers/v22/
          <fpage>20</fpage>
          -
          <lpage>1061</lpage>
          . html.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Défossez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Copet</surname>
          </string-name>
          , G. Synnaeve,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Adi</surname>
          </string-name>
          ,
          <article-title>High fidelity neural audio compression (</article-title>
          <year>2022</year>
          ). arXiv:
          <volume>2210</volume>
          .
          <fpage>13438</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Demkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <year>Birdclef 2024</year>
          ,
          <year>2024</year>
          . URL: https: //kaggle.com/competitions/birdclef-2024.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>J</surname>
          </string-name>
          . Pei,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <article-title>Mining frequent patterns without candidate generation</article-title>
          ,
          <source>ACM sigmod record 29</source>
          (
          <year>2000</year>
          )
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ridnik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ben-Baruch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Zamir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Noy</surname>
          </string-name>
          , I. Friedman,
          <string-name>
            <given-names>M.</given-names>
            <surname>Protter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zelnik-Manor</surname>
          </string-name>
          ,
          <article-title>Asymmetric loss for multi-label classification</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF international conference on computer vision</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>82</fpage>
          -
          <lpage>91</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bénédict</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Koops</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Odijk</surname>
          </string-name>
          , M. de Rijke,
          <article-title>Sigmoidf1: A smooth f1 score surrogate loss for multilabel classification</article-title>
          ,
          <source>arXiv preprint arXiv:2108.10566</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gorbachev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fedorov</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Slavutin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tugarev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fatekhov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tarkan</surname>
          </string-name>
          ,
          <article-title>Openvino deep learning workbench: Comprehensive analysis and tuning of neural networks inference</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          (ICCV) Workshops,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gu</surname>
          </string-name>
          , T. Dao, Mamba:
          <article-title>Linear-time sequence modeling with selective state spaces</article-title>
          ,
          <source>arXiv preprint arXiv:2312.00752</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>