<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Distilling Spectrograms into Tokens: Fast and Lightweight Bioacoustic Classification for BirdCLEF+ 2025</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anthony Miyaguchi</string-name>
          <email>acmiyaguchi@gatech.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Murilo Gustineli</string-name>
          <email>murilogustineli@gatech.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adrian Cheung</string-name>
          <email>acheung@gatech.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Georgia Institute of Technology</institution>
          ,
          <addr-line>North Ave NW, Atlanta, GA 30332</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The BirdCLEF+ 2025 challenge requires classifying 206 species, including birds, mammals, insects, and amphibians, from soundscape recordings under a strict 90-minute CPU-only inference deadline, making many state-of-the-art deep learning approaches impractical. To address this constraint, the DS@GT BirdCLEF team explored two strategies. First, we establish competitive baselines by optimizing pre-trained models from the Bioacoustics Model Zoo for CPU inference. Using TFLite, we achieved a nearly 10x inference speedup for the Perch model, enabling it to run in approximately 16 minutes and achieve a final ROC-AUC score of 0.729 on the public leaderboard post-competition and 0.711 on the private leaderboard. The best model from the zoo was BirdSetEficientNetB1, with a public score of 0.810 and a private score of 0.778. Second, we introduce a novel, lightweight pipeline named Spectrogram Token Skip-Gram (STSG) that treats bioacoustics as a sequence modeling task. This method converts audio into discrete "spectrogram tokens" by clustering Mel-spectrograms using Faiss K-means and then learns high-quality contextual embeddings for these tokens in an unsupervised manner with a Word2Vec skip-gram model. For classification, embeddings within a 5-second window are averaged and passed to a linear model. With a projected inference time of 6 minutes for a 700-minute test set, the STSG approach achieved a ifnal ROC-AUC public score of 0.559 and a private score of 0.520, demonstrating the viability of fast tokenization approaches with static embeddings for bioacoustic classification. Supporting code for this paper can be found at https://github.com/dsgt-arc/birdclef-2025.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Bioacoustics</kwd>
        <kwd>Spectrogram Tokenization</kwd>
        <kwd>Self-Supervised Learning</kwd>
        <kwd>Eficient Inference</kwd>
        <kwd>Acoustic Monitoring</kwd>
        <kwd>BirdCLEF</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The BirdCLEF+ 2025 challenge in the LifeCLEF Lab involves classifying fauna in the Middle Magdalena
Valley of Colombia [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Participants are provided several thousand one-minute soundscapes and
must predict the probability that one of 206 species appears in non-overlapping 5-second intervals.
While previous iterations of the BirdCLEF challenge have focused on avian calls, this year includes
mammals, insects, and amphibians.
      </p>
      <p>One important limitation of solutions is that they must fit within a 90-minute inference deadline on
a CPU-only instance provided on Kaggle. The computational constraint discourages solutions that rely
on gross computation to reach the top of the leaderboard through mechanisms like model ensembling.</p>
      <p>
        We address the BirdCLEF+ challenge in two parts. First, we provide baseline transfer-learning
solutions by training a classification head on pre-trained birdcall classification models. We hypothesize
that the representation space of existing bioacoustic models will transfer well to the domain-shifted
dataset. Existing research has shown that birdcall classifiers, such as BirdNET and Perch, handle domain
shifts well [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and even transfer to novel environments, including marine environments, with SurfPerch
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We optimize models for CPU in order to meet competition inference timeouts.
      </p>
      <p>We then experiment with a technique that utilizes discrete audio tokens for soundscape classification,
which we refer to as Spectrogram Token Skip-Gram (STSG). First, we convert audio into discrete tokens
derived from Mel-spectrograms. We then contextualize tokens into continuous space via skip-gram
embeddings and average tokens within prediction intervals to build a classification head, ofering
some reasonable defaults for the hyperparameters involved. Finally, we experiment with pretraining a
classification model on a surrogate task using a student-teacher model on unlabeled training soundscapes,
leveraging predictions from a strong bioacoustical model.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The dominant strategy for bioacoustic classification, particularly in BirdCLEF, involves treating audio as
a computer vision problem and applying CNN ensembles to Mel-spectrograms, 2D time-frequency
representations of audio signals [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Solutions often utilize transfer learning from EficientNet, ConvNeXt,
and similar backbones, as well as specialized bioacoustic models such as BirdNET and Perch [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
However, the increasing computational cost of these large ensembles poses a challenge in the constrained
competition setting. This has made inference optimization a critical component of state-of-the-art
methods, particularly in model compilation using TFLite, ONNX, or OpenVINO [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        In the broader field of audio processing, many newer models represent audio as compact sequences of
discrete tokens. This allows the use of powerful sequence modeling primarily used in the text domain,
including LLMs. These tokens can be divided into two types: acoustic tokens and semantic tokens.
Neural audio codecs, such as SoundStream [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and EnCodec [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], produce acoustic tokens optimized for
reconstructing the original waveform, which is often expensive for discriminative tasks. In contrast,
models that produce semantic tokens capture abstract information but require large self-supervised
architectures like wav2vec [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. While most audio tokenization eforts focus on speech data, this work
explores tokenization for resource-constrained scenarios in other audio domains.
      </p>
      <p>
        The word2vec skip-gram model is a simple method for learning dense vector embeddings from
large text corpora [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. It predicts surrounding context words given a single target word, learning a
low-dimensional vector for each word that captures semantic relationships. Our STSG method adapts
this principle to audio sequences, learning to embed spectrogram tokens that frequently co-occur in
time close to each other in the embedding space. This makes downstream classification more eficient, in
contrast to more complex models like AudioLM, which use expensive transformers to model sequences
of tokens [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>We split our methods into two main parts. We explore transfer learning in depth for the BirdCLEF 2025+
dataset using a variety of pre-trained soundscape backbones. Here, we establish baseline classification
and computational performance characteristics on state-of-the-art models and determine the engineering
work required to deploy them to resource-constrained systems. We then explore our central hypothesis
that a tokenized representation of spectrograms can be run with marginal classification degradation
while reducing inference computation by an order of magnitude.</p>
      <sec id="sec-3-1">
        <title>3.1. Transfer Learning with Bioacoustics Model Zoo</title>
        <p>
          We build a baseline for the competition by utilizing pre-trained backbones for bioacoustic fauna
classification. The process of building a new classification head transfers the knowledge of the backbone
to a new task, efectively reutilizing the representation space learned from the original task. A
domainspecific classifier for birds like Perch will be able to tightly cluster sounds that belong to specific classes
like those in Figure 2. These properties make pre-trained backbones desirable for similar domains.
We utilize the Bioacoustic Model Zoo to source our backbones, which serve as a companion to the
OpenSoundscape project [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. We enumerate the available models in Table 1.
        </p>
        <p>A large variety of models are available from the Bioacoustics Model Zoo, but they cannot be run
directly in the competition for various reasons. First, additional parameters and nodes in the computation
graph needed for training can be pruned, fused, and optimized for inference. We achieve an order of
magnitude speedup for Perch when using TFLite over TensorFlow, enabling the model to run within
competition time limits.</p>
        <p>The other source of contention is the impedance mismatch between the clip length of models and the
clip length expected by the competition. The BirdNET model exemplifies this issue since it is lightweight
enough to deploy as-is but requires adaptation for the competition. In cases like this, we apply a sliding
window over the 5-second target frame and extract embeddings from each window. These embeddings
can be aggregated by directly averaging them and applying a classification head or by running the
classifier on each window individually and averaging the resulting probabilities.</p>
        <p>To simplify model training, we first train the classifier on the original window embeddings. During
inference, we average the window embeddings over the target frame and run the classifier on the
average embedding. A window is within the target frame if it overlaps with it by at least 50% of the
window size, ensuring that each window contributes to only one target frame. We plan to explore other
aggregation methods in the future, such as max pooling; however, averaging works suficiently well as
a baseline.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Spectrogram Token Skip-Gram Embeddings for Classification</title>
        <p>The requirements for the Spectrogram Token Skip-Gram (STSG) embeddings are computationally
eficient transformations that map raw waveforms into discrete token space and tokens into continuous
embedding space. One advantage of this representation is that we retain the sequential nature of audio,
allowing us to contextualize a discrete token of audio with the surrounding tokens in time. We approach
this problem in three parts, as shown in Figure 3, specifically through tokenization, embedding, and
model pre-training.</p>
        <sec id="sec-3-2-1">
          <title>3.2.1. Spectrogram Tokenization via Faiss on MFCCs</title>
          <p>
            We choose a simple tokenizing scheme and sequence model to maximize inference throughput. Our
tokenizer clusters Mel-spectrograms across the training soundscape using K-means to generate a
codebook for discrete representation [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ].
          </p>
          <p>We set the spectrogram parameters in Table 2a in such a way that we obtain eight spectrogram
frames per second; in other words, each resulting token will represent 0.125 seconds of context from
the original spectrogram. This value is chosen as a power of two for eficiency and is selected to yield a
total sequence length of 40, corresponding to a 5-second prediction frame. We can increase the tokens
per second to increase our target sequence length but at the expense of losing temporal relevance at the
spectrogram level. We use a total of 768 Mel-frequency bands for testing. This number is less than 10%
of the window size, which helps reduce the number of zero-valued frequency bands.</p>
          <p>
            Once we choose parameters for the Mel-spectrograms, we compute them over the entire
training soundscape dataset. We perform principal component analysis (PCA) on the normalized
Melspectrogram vectors to reduce their dimensionality for K-means and retrieval. Normalizing the
spectrogram vector helps reduce the efects of decibel intensities and instead focus on the distribution of the
audio spectra. PCA will project the spectrogram frames into a low-rank space, which helps clustering
by reducing high-frequency noise and the overall computational and memory requirements. We run
K-means via Faiss over the frames to quantize them into integer tokens [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ].
          </p>
          <p>To obtain a new token, we generate the spectrogram frame for a given time interval, normalize it,
and project it into a low-rank space. We then find the closest K-means vector that corresponds to it
using the L2 distance. The distance can be eficiently estimated through approximate nearest neighbor
algorithms, such as those implemented in Faiss, including hierarchical navigable small worlds (HNSW).</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Skip-Gram Negative-Sampling Embeddings via Word2Vec</title>
          <p>Once we have discrete tokens from the audio, we learn a model that embeds the token into a continuous
space, which we can use for downstream classification tasks. Our soundscape data lacks proper labels
that we can utilize, but we can still learn about the relationship between tokens because audio is
naturally sequential. We achieve this in an unsupervised manner using a skip-gram negative-sampling
embedding model, where we learn the location of a token in a high-dimensional space that captures
temporal and semantic relevance within a sequential window.</p>
          <p>
            We use the Word2Vec implementation in gensim to obtain a static lookup table for a learned skip-gram
embedding model [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ]. We explore the initial parameter space in Table 2b. The objective is to learn
a vector representation for a target token, ⃗, that is predictive of its actual context tokens, ⃗, while
being dissimilar to  negative samples, ⃗ , drawn from the corpus. We minimize the loss function for
each positive pair [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ]:
sgns = − log( (⃗ · ⃗)) −
          </p>
          <p>∑︁ log( (− ⃗ · ⃗,))
=1</p>
          <p>The negative samples are drawn from a unigram distribution based on token frequency  () smoothed
by hyperparameter  (the ns_exponent) normalized over the corpus vocabulary:
 ()
 () = ∑︀′  (′)</p>
          <p>
            The value of  is a critical, task-dependent hyperparameter.  = 0.75 is standard for NLP tasks. A
higher exponent exaggerates the sampling of frequent noise tokens, forcing the model to become highly
discriminative between rare signal tokens and common background sounds. In contrast, a negative 
would make the model focus on distinguishing rare signals from each other [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ].
          </p>
          <p>We train a classification head on the resulting embeddings. An embedding-centric workflow ensures
that discrete token-based methods are comparable against vision methods over spectrograms.</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>3.2.3. STSG Student-Teacher Pretraining</title>
          <p>In addition to learning a continuous representation of the spectrogram tokens, we experimented with
student-teacher modeling to see if we could impart the bioacoustical domain knowledge from Perch
into a smaller student model based on the STSG embedding space. We configure the student with a
1D-CNN to aggregate STSG embeddings, project these into a latent space, and then attempt to match
the probability distribution of Perch using KL-divergence as the loss. We apply temperature scaling to
the outputs such that teacher probabilities are softened by a squared factor of temperature  = 3.</p>
          <p>We use the student model from Table 3 and train it against 80% of the training soundscape, using the
remaining 20% as validation. Then, we can pre-compute embeddings for the training species dataset,
just as we do for other models in the bioacoustics model zoo. In our modeling, we rely on complete
sequences, i.e., masking is not implemented.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Model Validation via Surrogate Classification Task</title>
        <p>To validate our transfer learning and token embedding models, we require a task that reflects the
characteristics of the competition task. We rely on the train species dataset in order to validate our
models. The surrogate task serves as a proxy for multi-label inference on soundscapes, for which we
lack labeled data to work. The training data skews in favor of commonly appearing species that are
easier for individual people to crowd-source recordings. We pre-compute vector representations of
the audio aligned to the step size of the original classifier. We drop species that have fewer than two
samples, then stratify the dataset into test and validation in an 80/20 split. We train a classification head
on the embeddings with a hidden layer, non-linearity, and then cross-entropy on a linear projection to
the output space. We document the architecture of the classification head in Table 4.</p>
        <p>We select a smaller semi-representative set of species to facilitate hyperparameter tuning sweeps over
the STSG embeddings. In Table 5, we use Gemini 2.5 Pro (gemini-2.5-pro-preview-05-06 with Gemini
App system prompting) to select species using provided taxonomy information, with explicit prompting
about the region and competition to ensure mammals, insects, and amphibians have representation.
We validate the number of intervals processed by Perch and note the distribution across our small
validation set.</p>
        <p>The competition metric is a macro-averaged ROC-AUC that skips classes with no true positive labels.
We use the macro-averaged ROC-AUC as a rough proxy for our experiments. This measure often fails
to provide precise measurements, as it frequently remains near 1.0 in transfer learning experiments.
We will also report the F1-macro score, which enables more precise observations between models with
diferent hyperparameters.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <sec id="sec-4-1">
        <title>4.1. STSG Hyperparameter Tuning</title>
        <p>The STSG training process has a large number of parameters that can afect the output results. For the
sake of time, we select a set of hyperparameters that sufice for validating that the pipeline works in
both training and inference and that it yields improvements over baseline methods. Refer to Table 2a
for the set of default parameters.</p>
        <p>Our first attempt at modeling used Mel-Frequency Cepstral Coeficients (MFCCs), which are
efectively the discrete cosine transform (DCT) of the Mel-spectrogram. We denote this STSG v1, with
hyperparameter plots in Appendix A. However, we artificially limit the representational power of the
tokens in v1 by being overly conservative with the parameters of 128 Mel bands and 20 MFCCs. In
Table 6, we describe the parameters for the v2 line of models that use the Mel-spectrogram
representation of 768 Mel-bands. The remaining hyperparameter results that follow utilize the Mel-spectrogram
representation.</p>
        <p>Given a spectrogram representation, we must determine how many centroids we compute over
the entire soundscape dataset. The centroids act as a data-driven vocabulary for the audio corpus. In
Figure 4, we sweep over the parameters of the vocabulary. We settle on a token set size of 16k, which
both fits inside a 2-byte integer and provides a reasonable amount of support, given that there are 3.73
million tokens in the 80% training split of the soundscape. While increasing the number of tokens to
32k improves the validation F1-macro scores, it also increases the overall required training time for the
embedding proportionally.</p>
        <p>Applying PCA to the tokenizer is also a net positive to the model. We choose 128 dimensions,
as shown in Table 7, which retains 87% of the original variance of the 768 Mel-spectrogram bands.
Reducing the number of dimensions increases modeling performance, likely due to the omission of
high-frequency noise and latent modeling of sparse substructures in the original data. Additionally,
many of the algorithms that we use (K-means and approximate K-NN) have algorithmic complexity
that is linearly proportional to the number of dimensions. Reducing the number of dimensions allows
us to fit more rows into memory and execute fewer instructions overall when tokenizing the signal.</p>
        <p>Once we have determined our token size, we conduct hyperparameter sweeps over the Gensim
Word2Vec model, as shown in Table 8. We vary over vector size, window size, negative sampling
exponent, and sample rate. The vector size determines the adequate capacity of the model and sweeps
over 128, 256, 384, 512, and 1028 dimensions. The dimension size is the most important parameter, as it
directly afects the model’s ability to learn complex relationships between tokens. Increasing the vector
size by a factor of three increases the F1-macro score by 0.018 but also increases the training time by a
factor of 1.7. We find that lowering the window size from the default of 80 decreases the score by 0.013,
whereas increasing the window size does not significantly improve the score. However, the training
time is significantly afected by the window size, with a smaller window size reducing the training time
by 112 seconds.</p>
        <p>We find that sweeping over the negative sampling exponent  does not yield a clear pattern; however,
in this particular sweep, setting  to 0.0 yields the best performance. This parameter controls the
distribution of negative samples, and a setting of 0.0 indicates a uniform distribution over the vocabulary.
A positive exponent means negative samples are more likely to be drawn from the more frequent tokens,
while a negative exponent means negative samples are more likely to be drawn from the less frequent
tokens. Finally, we sweep over the subsampling rate, which controls the frequency of token sampling
during training. We find that decreasing the sample rate to 1e-4 does not change the score but increases
the training time by a factor of two. Increasing the sample rate to 1e-6 results in a significant drop in
performance, as only the most frequent tokens are sampled, and the model is unable to learn from the
less frequent ones. These results help motivate the final choice of hyperparameters, which are balanced
to provide a good trade-of between performance and training time.</p>
        <p>(a) F1 score vs. ns_exponent.
(b) Word2Vec validation curves</p>
        <p>Given the hyperparameter sweeps, we then fix over the parameters and run over the negative
sampling exponent  . In our parameter scans, it was unclear how the negative sampling exponent
and sample sizes interact with each other. However, we find that when we set the sample to 1e-5,
then, the best  is 0.0. When we increase the number of samples by setting the sampling threshold to
1e-4, the best  decreases to -0.75. When we aggressively subsample tokens in the former setting, we
remove many of the most frequent tokens, which likely correspond to silence or background noise,
making the resulting token distribution flatter. The uniform strategy is efective since the noise is
already downsampled. When we are less aggressive with the samples, the distribution more closely
resembles the original distribution. Setting this to a negative value then forces the model to learn how
to distinguish between tokens corresponding to diferent species rather than focusing on the frequent
tokens.</p>
        <p>We also look at the performance over time in Figure 5b and find that 100 epochs are a reasonable
length of time to run the Word2Vec algorithm. In the first two iterations (v2.0 and v2.1), where we set
subsampling to 1e-5, we observe either overfitting or significant stochasticity in the validation curve
over time. We increase the number of tokens available for training in v2.2, and this stabilizes training
to a large degree.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. STSG Student Model Results</title>
        <p>We report the results of the STSG student model in Table 9 for the validation task, comparing it to the
Perch teacher model and our best token-based model. These initial results show that the teacher model
is competent, achieving a macro F1 score of 0.80 and a micro F1 score of 0.91. The Mel-spectrogram,
used directly by averaging the features over the time step, achieves a score of 0.12. The STSG model
performs better than the spectrogram baseline but worse than the teacher, with a score of 0.56. The
token model can learn a representation space of the soundscape that can be transferred to the training
dataset to a lower degree than the Perch model but is significantly better than choosing at random.</p>
        <p>Finally, we observe that the student model, trained on the STSG representation space and then
ifne-tuned on the surrogate task, achieves a macro F1 score of 0.47. The student model’s performance
represents a significant regression compared to using the STSG embeddings directly, suggesting that the
STSG embeddings are unable to capture the complex relationships needed to represent the Perch logits
efectively and that the geometry of the learned embedding space does not transfer well to classifying
the training species dataset.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Transfer Learning and Leaderboard Results</title>
        <p>We report the transfer learning surrogate task in Table 10. Here, we note that the ROC-AUC score does
not help us disambiguate between the relative performance of models. We note that the F1-macro score
correlates well with the ranked performance of the models while providing larger diferences in scores
for individual observations. The metric is also built into the classification report in the scikit-learn
library. We find that our best STSG models yield results around 0.56, whereas the best computer vision
backbone models achieve results around 0.87.</p>
        <p>We report the final model performance and timing on the leaderboard in Table 11. We note that the
primary metric that we are looking at here is the modified ROC-AUC scores, which are much lower
than the ROC-AUC metrics in our surrogate task. We note that the bioacoustic model zoo performs well
for this simple transfer learning task without requiring complex modeling to address the domain shift
in the datasets. In comparison, the token embedding models achieve a modest score of around 0.56.</p>
        <p>However, the STSG model is lightweight and fast. It is three times as fast (0.5 seconds per file vs.
1.4 seconds per file) when compared to Perch on TFLite. The Word2Vec modeling is weak but can be
replaced by more robust and complex token embeddings without altering the runtime characteristics
as long as the token embeddings are static. Additionally, we found that many of the Torch-based
bioacoustical models did not require any compilation or optimization to run within the competition
bounds. A significant amount of tuning can be done despite the computational constraints, as evident
from the final leaderboard for the 2025 competition.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <sec id="sec-5-1">
        <title>5.1. Transfer Learning</title>
        <p>While the transfer learning experiments we have run are methodologically simple, we were able to
test a much wider variety of models than in previous years. The bioacoustic model zoo is an excellent
library for development, as it provides a reasonable API into many popular bioacoustic backbones.
We visualize the embedding space extracted from the library in Figure 6, which provides a qualitative
view of the geometric structure of the backbones. The tight clustering behavior of BirdNET, Perch, and
BirdSet models is desirable, as it facilitates more straightforward mapping to our final objective space.
The color of the data corresponds to the one-dimensional projection of the embedding in BirdNET,
which efectively ranks data points on a number line. We expect to see a smooth transition in color
based on distance in two dimensions. The discoloration in RanaSierraCNN reflects the leaderboard
performance and the decrease in performance due to ambiguity between similar points. HawkEars
is a model that we were unable to test on the leaderboard due to its ensembling structure; however,
the domain-specific nature of Hawks leads to a more substantial domain shift, also observed in the
Rana Sierra Frog classifier. We also note that we did not test YAMNet due to the unusual clip length,
although it may be helpful as a model for filtering.</p>
        <p>
          BirdNET and Perch continue to be staple models that provide state-of-the-art performance with
broad reach in real deployments. The BirdSet dataset has made it much easier to develop new backbone
models that rival the performance of BirdNET and Perch while staying in the Torch ecosystem and
with prediction windows corresponding to the BirdCLEF challenge [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. The Tensorflow software stack
complicates experiments where the primary modeling tool is Torch. In particular, we encountered
complications while running models in the bioacoustic model zoo that relied on both TensorFlow
and Torch, which required identifying the appropriate version pins for inference. Surprisingly, the
BirdSetEficientNetB1 model outperformed the TensorFlow models, achieving the best score on the
public leaderboard at 0.81. The BirdSet dataset and resulting models will likely form the basis of strong
solutions in future iterations of the competition due to the ease of benchmarking and tooling.
(a) BirdNET.
        </p>
        <p>(b) BirdSetConvNeXT.</p>
        <p>(c) BirdSetEficientNetB1.
(d) HawkEars.</p>
        <p>(e) Perch.</p>
        <p>(f) RanaSierraeCNN.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Spectrogram Tokens and Sequence Modeling</title>
        <p>In our experiment, we tested the ability to model bioacoustic data as discrete sequences and to determine
if it could serve as a basis for student-teacher modeling. We discovered a fast and lightweight token
representation for our data that enables unsupervised modeling despite its limited discriminative
power. Spectrogram tokens are an interesting result because they rely on algorithmic building blocks of
clustering and nearest-neighbor search to represent the sequence sparsely. The information bottleneck
in our process lies in the signal representation given by the spectrogram. When we increased the total
number of Mel-bands and avoided dimensionality reduction early in the pipeline, our validation scores
increased by approximately 0.1 in F1-macro. A better representation leads to improved tokenization,
enabling the sequence model to better represent the audio for classification purposes.</p>
        <p>
          The idea of quantizing audio for downstream applications is not unique to this work. There are
data-driven neural codecs for compressing and decompressing audio, such as EnCodec [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. However,
our team’s initial explorations into neural codecs in the 2024 competition placed EnCodec specifically
in the non-viable category due to the computational overhead of attention [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Other frameworks, such
as wav2vec 2.0, closely mirror our pipeline but rely on a latent audio representation via CNNs that are
quantized and then contextualized by Transformers. Wav2vec would likely not work without
modification for the BirdCLEF competition due to imposed inference deadlines. The CNN for representing
speech is likely viable due to the overwhelming presence of CNNs already. However, the Transformer
network would need an eficient approximation at inference time.
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Future Work</title>
      <p>Future work will place greater emphasis on modeling complex sequences. In particular, we would like
to see better token representations and the use of Transformers to contextualize sequences of tokens.
It would be interesting to see how much of the wav2vec framework applies to BirdCLEF and to see
whether contextualized tokens via Transformers can run on constrained systems.</p>
      <p>While the STSG representation is fast, it leaves significant room for improvement. We can use up
most of the inference budget to calculate high-quality tokens, assuming we apply a static lookup table
with a lightweight classification head. The first direction is to replace the spectrogram token with a
YAMNet-based token, which, although resulting in a shorter 2-token-per-second sequence, provides
a more powerful representation that will form more meaningful clusters under K-means. Another
approach might be to use a clustering algorithm, such as HDBSCAN, to identify centroids that are
better suited to the topology. However, we lose the performance characteristics from having a highly
optimized ANN algorithm in -space. The final direction is to learn tokens using wav2vec and to
optimize the framework for CPU inference heavily. Wav2vec is promising, although it requires a
significant amount of engineering work to execute.</p>
      <p>
        One of the factors that make transformer-based models challenging to use in the BirdCLEF+
competition is that inference is expensive with naive implementations due to the sheer number of parameters
involved. However, transformers are a natural way to represent sequential data because attention
eficiently contextualizes tokens within sequences [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. We may be able to statically capture token
representations learned by parameterization of transformer networks that are far larger than can be run
on an inference machine by using compilation methods such as those found in Model2Vec [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. This
technique establishes a static mapping of tokens by embedding single-token sequences, compressing
them using PCA, and applying an information-based reweighting scheme that assumes a Zipf
distribution of tokens. These tokens can then be averaged across sentences and have demonstrated good
performance on encoder-only, downstream tasks.
      </p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>Through this work, we address computational limitations in bioacoustical classification by leveraging
transfer learning of domain-specific classifiers and a lightweight token-based approach. We demonstrate
state-of-the-art models and their performance on the domain-shifted data through models like BirdNET
and Perch in the bioacoustic model zoo. Our best model was BirdSetEficientNetB1, with a private
ROC-AUC score of 0.778 and a public score of 0.810. We also investigate the foundations of distilling
complex sequence models into a static lookup of contextualized tokens. Our Spectrogram Token
SkipGram (STSG) pipeline achieves a runtime of 6 minutes on the test dataset, yielding a final ROC-AUC
private score of 0.520 and a public score of 0.559. The STSG pipeline provides the basis for more
complex sequence models, as it enables the discovery of a representational space that can be parsed
and transformed eficiently on a CPU. Although the overall performance is low, there are interesting
directions that this research can take, given that spectrogram tokens possess representational power.</p>
      <p>Supporting code for this paper is located at https://github.com/dsgt-arc/birdclef-2025.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>
        We thank the Data Science at Georgia Tech (DS@GT) CLEF competition group for their support. This
research was supported in part through research cyberinfrastructure resources and services provided by
the Partnership for an Advanced Computing Environment (PACE) at the Georgia Institute of Technology,
Atlanta, Georgia, USA [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used Gemini Pro in order to: Abstract drafting,
formatting assistance, grammar and spelling check. After using these tool(s)/service(s), the author(s)
reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-10">
      <title>A. STSG v1 Hyperparameter Tuning</title>
      <p>The first version of STSG uses a spectrogram with 128 Mel bands and 20 MFCCs.
+0
Baseline
Varying vector_size
Varying window
Varying ns_exponent
Varying sample
Combination
window
window
sample
sample
window
ns_exponent
-0.043
+0.025
+0.021
+0.071
+0.002
+0.003
+0.016
-0.023
-0.001
-0.045
-0.019
+363</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Cañas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Toro-Gómez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rodriguez-Buritica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Benavides-Lopez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Ulloa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Caycedo-Rosales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Glotin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-P.</given-names>
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Planqué</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of BirdCLEF+
          <year>2025</year>
          <article-title>: Multi-taxonomic sound identification in the middle magdalena valley, colombia</article-title>
          ,
          <source>in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Adam</surname>
          </string-name>
          , et al.,
          <source>Overview of lifeclef</source>
          <year>2025</year>
          :
          <article-title>Challenges on species presence prediction and identification, and individual animal identification</article-title>
          ,
          <source>in: International Conference of the Cross-Language Evaluation Forum for European Languages</source>
          , Springer,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <article-title>Global birdsong embeddings enable superior transfer learning for bioacoustic classification</article-title>
          ,
          <source>Scientific Reports</source>
          <volume>13</volume>
          (
          <year>2023</year>
          )
          <fpage>22876</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. van Merriënboer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dumoulin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hamer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Fleishman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>McKown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Munger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Rice</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lillis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>White</surname>
          </string-name>
          , et al.,
          <article-title>Using tropical reef, bird and unrelated sounds for superior transfer learning in marine bioacoustics</article-title>
          ,
          <source>Philosophical Transactions B</source>
          <volume>380</volume>
          (
          <year>2025</year>
          )
          <fpage>20240280</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Srivathsa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Arvind</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>CP</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sawant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. V.</given-names>
            <surname>Robin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Glotin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-P.</given-names>
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Planqué</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of BirdCLEF 2024:
          <article-title>Acoustic identification of under-studied bird species in the western ghats</article-title>
          ,
          <source>Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Miyaguchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cheung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gustineli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>Transfer learning with pseudo multi-label birdcall classification for ds@gt birdclef 2024</article-title>
          , Working Notes of CLEF 2024 -
          <article-title>Conference and Labs of the Evaluation Forum (</article-title>
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Zeghidour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Luebs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Omran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Skoglund</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tagliasacchi</surname>
          </string-name>
          ,
          <string-name>
            <surname>Soundstream:</surname>
          </string-name>
          <article-title>An end-toend neural audio codec</article-title>
          ,
          <source>CoRR abs/2107</source>
          .03312 (
          <year>2021</year>
          ). URL: https://arxiv.org/abs/2107.03312. arXiv:
          <volume>2107</volume>
          .
          <fpage>03312</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Défossez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Copet</surname>
          </string-name>
          , G. Synnaeve,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Adi</surname>
          </string-name>
          ,
          <article-title>High fidelity neural audio compression</article-title>
          ,
          <source>arXiv preprint arXiv:2210.13438</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Baevski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Auli</surname>
          </string-name>
          , wav2vec
          <volume>2</volume>
          .
          <article-title>0: A framework for self-supervised learning of speech representations</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>12449</fpage>
          -
          <lpage>12460</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Corrado,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Eficient estimation of word representations in vector space</article-title>
          ,
          <source>arXiv preprint arXiv:1301.3781</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Borsos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Marinier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Vincent</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kharitonov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Pietquin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sharifi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Roblek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Teboul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Grangier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tagliasacchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Zeghidour</surname>
          </string-name>
          ,
          <article-title>Audiolm: A language modeling approach to audio generation</article-title>
          , IEEE/ACM Trans.
          <article-title>Audio, Speech and Lang</article-title>
          .
          <source>Proc. 31</source>
          (
          <year>2023</year>
          )
          <fpage>2523</fpage>
          -
          <lpage>2533</lpage>
          . URL: https: //doi.org/10.1109/TASLP.
          <year>2023</year>
          .
          <volume>3288409</volume>
          . doi:
          <volume>10</volume>
          .1109/TASLP.
          <year>2023</year>
          .
          <volume>3288409</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lapp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rhinehart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Freeland-Haynes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Khilnani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Syunkova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kitzes</surname>
          </string-name>
          ,
          <article-title>Opensoundscape: an open-source bioacoustics analysis package for python</article-title>
          ,
          <source>Methods in Ecology and Evolution</source>
          <volume>14</volume>
          (
          <year>2023</year>
          )
          <fpage>2321</fpage>
          -
          <lpage>2328</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mermelstein</surname>
          </string-name>
          ,
          <article-title>Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences</article-title>
          ,
          <source>IEEE Transactions on Acoustics, Speech, and Signal Processing</source>
          <volume>28</volume>
          (
          <year>1980</year>
          )
          <fpage>357</fpage>
          -
          <lpage>366</lpage>
          . doi:
          <volume>10</volume>
          .1109/TASSP.
          <year>1980</year>
          .
          <volume>1163420</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Douze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Guzhva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Deng</surname>
          </string-name>
          , J. Johnson, G. Szilvasy, P.-E. Mazaré,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lomeli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hosseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jégou</surname>
          </string-name>
          ,
          <article-title>The faiss library (</article-title>
          <year>2024</year>
          ). arXiv:
          <volume>2401</volume>
          .
          <fpage>08281</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>R.</given-names>
            <surname>Řehůřek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sojka</surname>
          </string-name>
          ,
          <article-title>Software Framework for Topic Modelling with Large Corpora</article-title>
          ,
          <source>in: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks</source>
          ,
          <string-name>
            <surname>ELRA</surname>
          </string-name>
          , Valletta, Malta,
          <year>2010</year>
          , pp.
          <fpage>45</fpage>
          -
          <lpage>50</lpage>
          . http://is.muni.cz/publication/884893/en.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>H.</given-names>
            <surname>Caselles-Dupré</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lesaint</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Royo-Letelier</surname>
          </string-name>
          ,
          <article-title>Word2vec applied to recommendation: Hyperparameters matter</article-title>
          ,
          <source>in: Proceedings of the 12th ACM Conference on Recommender Systems</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>352</fpage>
          -
          <lpage>356</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>L.</given-names>
            <surname>Rauch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Schwinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wirth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Heinrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Huseljic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Herde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lange</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tomforde</surname>
          </string-name>
          , et al.,
          <article-title>Birdset: A large-scale dataset for audio classification in avian bioacoustics</article-title>
          ,
          <source>arXiv preprint arXiv:2403.10380</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tulkens</surname>
          </string-name>
          , T. van Dongen,
          <article-title>Model2vec: Fast state-of-the-art static embeddings (</article-title>
          <year>2024</year>
          ). URL: https://github.com/MinishLab/model2vec.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>PACE</surname>
          </string-name>
          ,
          <article-title>Partnership for an Advanced Computing Environment (PACE</article-title>
          ),
          <year>2017</year>
          . URL: http://www. pace.gatech.edu.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>