<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Tackling Domain Shift in Bird Audio Classification via Transfer Learning and Semi-Supervised Distillation: A Case Study on BirdCLEF+ 2025</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Volodymyr Sydorskyi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fernando Gonçalves</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute"</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <abstract>
        <p>We present our solution from team volodymyr vialactea to the BirdCLEF+ 2025 challenge, which achieved stateof-the-art performance, placing 2nd on the Private Leaderboard with a ROC AUC of 0.928 on the Private test set and 0.925 on the Public test set. Our system is based on five key components: a strong baseline model, in-domain transfer learning, semi-supervised learning implemented via model distillation to mitigate domain shift, postprocessing, and model ensembling. We conduct an ablation study to evaluate the contribution of each component and analyze the efects of diferent augmentations and data setups. Furthermore, we investigate the domain shift between training and test distributions and explore strategies for its mitigation. Our code is publicly available at https://github.com/VSydorskyy/BirdCLEF_2025_2nd_place.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Birdcall classification</kwd>
        <kwd>Transfer learning</kwd>
        <kwd>Semi-supervised learning</kwd>
        <kwd>Domain adaptation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>• Ensembling diverse models, and performing post-processing of model probabilities.</p>
      <p>The application of these techniques resulted in a highly robust and accurate system developed by
team volodymyr vialactea, achieving a score of 0.928 on the Private Leaderboard and 0.925 on the Public
Leaderboard—ultimately securing second place in the final Private ranking. In addition, we conduct an
ablation study to evaluate the impact of individual components in our pipeline.</p>
      <p>The remainder of this paper is organized as follows: Section 2 reviews related work that underpins
our approach. Section 3 provides an overview of the BirdCLEF+ 2025 task. Section 4 details the structure
and properties of the training and test datasets. Section 5 presents the proposed approach. Section 6
reports ablation studies and experimental findings. Finally, Section 7 discusses the main limitations of
the current solution, outlines directions for future improvements, and concludes the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The gold standard in sound classification tasks is the use of a Spectrogram → CNN → Classification head
architecture [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. This pipeline is also widely adopted in birdcall classification tasks [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. However,
a range of modifications have been proposed. For example, [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] introduced a classification head that
combines recurrent neural networks (RNNs) with an attention mechanism. This design enables training
on weak (clip-level) labels while allowing the prediction of strong (frame-level) labels. One of the most
widely used deep learning systems for birdcall classification is BirdNET [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Beyond the standard
spectrogram-based CNN approach, BirdNET enhances its architecture by using multiple spectrogram
variants that emphasize diferent frequency ranges. Alternative strategies include the use of 1D CNNs
applied directly to raw audio signals [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], hybrid 1d CNN + 2d CNN methods [17], and
transformerbased models such as ECAPA-TDNN [18]. However, these alternative methods have not consistently
demonstrated superior performance in bird classification and tend to provide meaningful improvements
primarily when used as part of ensembles—at least as observed in BirdCLEF competitions. A notable
limitation of the BirdNET approach is its dependence on a fixed CNN backbone, specifically EficientNet
[19]. As newer CNN architectures such as NFNet [20] and EficientNetV2 [ 21] are developed, BirdNET
requires continuous architectural updates to remain competitive with state-of-the-art performance.
      </p>
      <p>The challenge of domain adaptation is common across many machine learning tasks, and numerous
methods have been proposed to address it [22, 23, 24]. However, domain-specific challenges often
require tailored solutions. In the BirdCLEF 2024 competition, several approaches were proposed to
address the domain shift between the focal (training) and soundscape (test) recordings. For instance,
[25] applied classical semi-supervised learning to include unlabeled test examples in the training
pipeline. Additionally, techniques such as no-call classifiers, test-time audio scaling, frequency-based
noise removal, and domain-distance-based filtering were explored. It is important to note that since
BirdCLEF 2024, unlabeled soundscape recordings have been made available to participants, enabling
the use of semi-supervised learning techniques [26, 27]. Among the proposed solutions, using binarized
pseudo-labels [25] tends to result in overconfident predictions, while hand-crafted domain adaptation
strategies often lack flexibility and fail to introduce robust adaptation mechanisms. Thus, there is clear
room for improvement in this direction.</p>
      <p>In modern deep learning, transfer learning has become a foundational component across nearly all
application domains [28]. In the audio classification domain, the situation is more nuanced. For example,
PANNs—a family of CNNs pre-trained on the large-scale AudioSet—have shown strong performance
across downstream audio tasks [29]. Interestingly, CNN encoders pre-trained on ImageNet also improve
birdcall classification performance when applied to spectrograms [ 30]. Furthermore, recent approaches
such as [25] utilize BirdNET and the Google Bird Vocalization Classifier [ 31] to extract audio embeddings
and train dataset-specific classification heads on top. While these strategies clearly enhance performance,
they still underutilize the full potential of transfer learning—especially in the context of continually
expanding birdcall databases. This suggests that more aggressive and systematic use of transfer learning
could yield further improvements.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Task Overview</title>
      <p>
        The BirdCLEF+ 2025 competition focused on the challenge of developing machine learning systems
to identify under-studied species in the lowlands of the Magdalena Valley, Colombia, based on their
acoustic signatures. Compared to previous editions, the competition expanded beyond bird species
(Aves) to include other animal taxa as well (see Table 1). The training data were compiled from three
major sources: Xeno-Canto [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], iNaturalist [32], and the CSA collection provided by the Instituto
Humboldt [33]. The test data consisted of 1-minute long soundscape recordings collected through
passive acoustic monitoring. All recordings were provided in Ogg format and resampled to 32 kHz.
Participants were required to predict the probability of presence for each of the 206 target classes in
every 5-second chunk of each audio file. The test set contained 700 soundscape samples, with a 34%/66%
split between the Public and Private leaderboard sets. Additionally, strict computational constraints
were imposed: all test-time predictions had to be executed within a 90-minute time limit using a Kaggle
Notebook running on CPU-only (typically an Intel® Xeon® CPU @ 2.20GHz). Submissions were
evaluated using a version of macro-averaged ROC-AUC that skips classes which have no true positive
labels1.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Data Overview</title>
      <sec id="sec-4-1">
        <title>4.1. Training Data</title>
        <p>The training set consists of 28,564 audio samples, totaling approximately 280.5 hours of audio. The
distribution across diferent data sources is highly imbalanced, with the vast majority of recordings
coming from Xeno-Canto (see Table 2). A similar imbalance is observed in species distribution (see
Figure 1): there are 39 species with fewer than 10 recordings—referred to as "undersampled"—and 9
species with more than 500 recordings. Overall, the imbalance can be quantified as Imbalance Ratio =
max  = 495, where  is the sample count for class . In addition to the primary species, some
min 
Xeno-Canto recordings include secondary (background) species, while such labels are absent in other
sources. These secondary species appear in 4,958 recordings and can be used to model the multilabel
structure of the test data. Finally, the training data was collected by 2,772 unique recordists, with 1,612
recordings lacking any metadata about the recordist.</p>
        <sec id="sec-4-1-1">
          <title>1https://www.kaggle.com/code/metric/birdclef-roc-auc</title>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Soundscapes</title>
        <p>At the time of writing, the test set is not publicly available, preventing direct exploratory data analysis
of test soundscapes. However, assuming that the test soundscapes follow the same (or at least a similar)
distribution as the available unlabeled soundscapes, we approximate our analysis using pseudo labels
(see Section 5.4). It is important to note that the following analysis may be afected by classification
model bias, primarily inherited from the training data. For the analysis, we utilize pseudo labels from the
ifrst pseudo-iteration generated in non-OOF mode. In total, 9,726 unlabeled soundscape recordings are
available, which correspond to 116,712 audio chunks of 5 seconds each. A primary point of interest is the
analysis of domain shift between the training data and the soundscapes. We begin with a comparison
of species distributions in the training data and the soundscapes. For this purpose, we count a species
as present in a chunk if its predicted probability exceeds 0.5. As shown in Table 9, certain species are
highly underrepresented in the training data but frequently appear in the soundscape recordings, and
vice versa. Overall, the Pearson correlation between the number of training and soundscape occurrences
per species is 0.0958, and the Spearman correlation is 0.2855. Another manifestation of domain shift
lies in the number of species present in a single recording. In the training data, over 90% of samples
contain only a single species, and the maximum number of species per recording is 12 (observed in only
2 recordings). In contrast, soundscape files demonstrate a much broader distribution: approximately
54% of recordings contain no identifiable species, 11% contain one species, 7% contain two, and some
contain up to 25 species (see Table 10). Notably, soundscape files also include "nocall" samples. Another
significant contributor to domain shift is the variation in recording conditions and devices. These
diferences are clearly illustrated by the spectrogram comparison shown in Figure 6. Finally, perhaps
the most impactful aspect of domain shift lies in the annotation procedure: training data is annotated in
a weak manner—labels apply to the entire recording, which may span several minutes, even though the
bird vocalization may occur only for several seconds. In contrast, soundscape data is strongly labeled,
providing species presence annotations at a fine temporal resolution of every 5-second chunk.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Additional Training Data</title>
        <p>Using the open APIs of Xeno-Canto, iNaturalist, and CSA, we collected additional training samples for
species of interest. We downloaded only the samples that comply with the competition’s data license234.
The full distribution of these additional samples is shown in Table 3. Interestingly, the most significant
performance improvements came from incorporating the Xeno-Canto dump from 28.03.2025 and a small
set of New data samples from previous competitions. Including all additional datasets reduced the number
of undersampled classes to 28, while using only the two most impactful sources reduced it to 38.</p>
        <sec id="sec-4-3-1">
          <title>2https://www.kaggle.com/competitions/birdclef-2025/rules#7.-data-access-and-use 3https://www.kaggle.com/competitions/birdclef-2025/discussion/570760#3174229 4https://www.kaggle.com/competitions/birdclef-2025/rules#6.-external-data-and-tools</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Method</title>
      <sec id="sec-5-1">
        <title>5.1. Validation</title>
        <p>Building reliable validation remained one of the most significant challenges in the BirdCLEF series,
primarily due to the domain shift described in Section 4.2. A baseline strategy was implemented
using 5-fold cross-validation, stratified by primary species and grouped by author. This configuration
helped maintain class distribution across folds while reducing potential data leakage, particularly from
recordists contributing multiple sequential recordings. However, this baseline setup struggled with
undersampled species, which were often represented by only 1–2 samples per fold or omitted entirely.
To address this, the baseline was extended with one of the following modifications:
1. Adding undersampled species to validation folds – Undersampled species were placed in
the validation set by ensuring at least one instance from each class, with corresponding samples
removed from training. This allowed performance assessment on all classes, but further decreased
the training data for rare species.
2. Adding all undersampled species to train folds – All available samples from undersampled
species were included solely in the training set and excluded from validation. This eliminated
evaluation of these classes but likely improved their recognition performance by enlarging the
training data and minimizing label noise.</p>
        <p>Despite these strategies, the correlation between local validation and Public Leaderboard scores
remained relatively low. In general, large improvements in both validation and test performance were
observed during early experimentation. However, once the Public ROC AUC score approached 0.9,
the correlation between validation metrics and test results weakened considerably. Two metrics were
computed: Mean ROC AUC (averaged across all folds) and Out-of-Fold (OOF) ROC AUC (evaluated by
concatenating all validation predictions). The Pearson and Spearman correlations between Mean ROC
− 0.1200 (Spearman). The corresponding scatter plots are shown in Figure 2.</p>
        <p>AUC and Public scores were 0.8643 and 0.5438, respectively. In contrast, within the high-performing
region where Public scores exceeded 0.9, correlations dropped sharply to − 0.1291 (Pearson) and</p>
        <p>Additionally, we experimented with mixing multiple recordings from the training set or even from
soundscapes during the validation stage in an attempt to improve the consistency of validation scores.
However, this approach did not increase the correlation between local validation and Public Leaderboard,
and also introduced additional randomness due to varying sample combinations.</p>
        <p>It is noteworthy that, as shown in Table 5, Public scores consistently exceeded 0.9 once pseudo-labeling
(see Section 5.4) was applied—serving primarily as a domain adaptation strategy. This observation
suggests that validating directly on soundscape files may be a more appropriate approach. Finally,
validation remained reasonably consistent during the final ensembling stage (see Section 6.2).</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Baseline</title>
        <p>The baseline approach was largely inspired by the BirdCLEF 2023 1st place solution5. The main
distinction lies in replacing the ConvNeXt [34] model family with EficientNetV2 [ 21]. Our approach
primarily relied on two CNN backbones: EficientNetV2-S and NFNet-L0. The key components of the
baseline system included:
• Augmentations to address domain shift:
data.
grams.</p>
        <p>
          distortions.
– MixUp [35]: Two audio waveforms were added in the audio domain, and the resulting
one-hot target vector was computed as the element-wise maximum across the original
vectors. This augmentation was critical to better mimic the multilabel nature of the test
– Background mixing: Background audio from prior-year soundscapes and the ESC-50
dataset [36] was randomly overlapped with training samples.
– SpecAugment [37]: Standard time and frequency masking were applied to mel
spectro– RandomFiltering: A simplified version of a random equalizer was used to simulate channel
• Data sampling strategy followed Equation 1 with  = − 0.5.
• Use of secondary labels (if present) with equal weight to the primary label. This transformed
the task into a multilabel classification problem, optimized via a linear combination of Binary
Cross Entropy (BCE) and Focal loss [
          <xref ref-type="bibr" rid="ref17">38</xref>
          ].
• Backbone and classification head : An NFNet-L0 or EficientNetV2-S CNN backbone was
combined with a classification head inspired by [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], omitting RNN blocks and using weakly
labeled prediction during both training and inference.
• Additional datasets were included as described in Section 4.3.
        </p>
        <p>=
︃(</p>
        <p>∑︀ 
)︃
(1)</p>
        <p>•  is the computed weight for class ,</p>
        <sec id="sec-5-2-1">
          <title>5https://github.com/VSydorskyy/BirdCLEF_2023_1st_place</title>
          <p>•  is the number of samples for class ,
•  is the scaling exponent (typically  &lt;
balanced sampling).</p>
          <p>0 to emphasize rare classes;  = − 1 corresponds to</p>
          <p>Models were trained for each validation fold, and final predictions on the Public and Private test sets
were obtained by ensembling the best model from each fold via arithmetic averaging.</p>
          <p>
            The baseline was enhanced by several modifications:
• Label Smoothing [
            <xref ref-type="bibr" rid="ref18">39</xref>
            ], as defined in Equation 2, with  = 0.05.
• Modified data sampling , using two diferent strategies:
1. For NFNet: Initial data sampling was used, but all classes with fewer than 100 samples were
duplicated until they reached this threshold.
2. For EficientNetV2: Fully balanced sampling with  = − 1.
          </p>
          <p>˜ =  · (1 −  ) +  ·
∑︀ 

(2)
• ˜ is the smoothed label for class ,
•  is the original one-hot encoded label (1 for the correct class, 0 otherwise),
•  is the label smoothing factor,
•  is the total number of classes.</p>
          <p>To summarize, augmentations were introduced to mitigate domain shift and increase robustness to
varying recording conditions, while also helping reduce overfitting to training data distribution. Data
sampling strategies addressed the severe class imbalance and the evaluation metric’s equal weighting
across classes. Label smoothing was adopted to counter the noisy nature of weak labels. Finally, the
chosen CNN backbones demonstrated strong performance in birdcall classification tasks and produced
suficiently diverse outputs, contributing positively to ensemble performance. The detailed training and
inference setup is provided in Appendix B.</p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Transfer Learning</title>
        <p>As mentioned in Section 2, CNN backbones pretrained on ImageNet already provide a strong
initialization for birdcall classification. However, pretraining on in-domain data is expected to yield better
performance. The pretraining pipeline was organized as follows:
1. Species Selection: A taxonomy of all species from previous BirdCLEF competitions was collected,
resulting in a set of 16,607 unique species.
2. Data Collection: Audio recordings and associated metadata were downloaded from Xeno-Canto
and from previous BirdCLEF competitions.
3. Data Pruning: To avoid data leakage, all species included in BirdCLEF+ 2025 were removed.</p>
        <p>Corrupted files and recordings with unmatched or invalid ebird codes were excluded. Additionally,
species with fewer than 10 recordings were discarded to minimize label noise. The final pretraining
dataset comprised 819,032 recordings spanning 7,489 species.
4. Training Preparation: The dataset was split into training and holdout subsets, with 5% of the
data reserved for holdout evaluation. Validation was performed only on species with at least 100
recordings, to reduce evaluation noise. In total, the evaluation covered 1,627 bird species.
5. Pretraining: Models were trained using the baseline configuration (see Section 5.2), but without
any class balancing. The objective was to learn general audio and birdcall structure, making
sampling adjustments unnecessary at this stage.
6. Checkpoint Preparation: The best checkpoint based on holdout macro ROC AUC was selected.</p>
        <p>Only the CNN backbone was used; the classification head was reinitialized.
7. Finetuning: Fine-tuning was conducted using the same setup as described in Section 5.2, without
any modifications to learning rate or scheduling.</p>
        <p>Using a pretrained CNN backbone resulted in faster convergence during early training and consistently
improved performance on both local cross-validation and the Public and Private test sets.</p>
        <p>Additional experiments were conducted using an extended pretraining dataset that included samples
from CSA, later Xeno-Canto dumps, and public Kaggle datasets. However, these variants did not yield
further improvements in target metrics.</p>
        <p>Moreover, several metric learning approaches were explored:
• Taking two disjoint 5-second segments from the same or diferent soundscape recordings and
predicting whether they originated from the same audio file.
• Applying the previous strategy to training data recordings.
• Sampling two 5-second segments from the same or diferent species and predicting whether the
species labels matched.</p>
        <p>Unfortunately, none of these approaches proved efective. Due to the weak labels in the training data
and noise in the pseudo labels for soundscapes, a substantial portion of many recordings contained
background noise or mislabeled segments, which introduced too much noise for the models.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Pseudo Labelling</title>
        <p>To better address domain shift and increase the volume of training data, a semi-supervised learning
strategy was adopted. The overall pipeline is illustrated in Figure 3.</p>
        <p>Initially, eight models were trained using the enhanced baseline configuration, comprising NFNet-L0
and EficientNetV2-S backbones initialized from our pretrained checkpoints. These models difered in
spectrogram types and optimization parameters. Their predictions on the unlabeled soundscape dataset
were averaged via mean ensembling. Subsequently, a dedicated Pseudo Selection Logic was applied.
For each 5-second chunk, the maximum predicted class probability was computed. Chunks with a
maximum probability below 0.5 were discarded. For retained chunks, all class probabilities below 0.1
were zeroed out. This resulted in a pseudo-labeled dataset consisting of 5-second audio segments, each
associated with a filtered soft target vector. The use of soft targets served two purposes: it discouraged
overconfident predictions and enabled a form of knowledge distillation from the ensemble. Zeroing out
low-confidence probabilities helped suppress noise from uncertain predictions.</p>
        <p>Since the sampling strategy for the original training data was already carefully calibrated, we did not
simply concatenate the pseudo-labeled dataset to the fold’s training data. Instead, a dedicated Sampling
Strategy was introduced, illustrated in Figure 4.</p>
        <p>
          Additionally, the use of MixUp augmentation introduced an interesting fusion mechanism between
hard labels from the original training data and soft labels from the pseudo-labeled samples. Audio
segments were mixed in the audio domain, efectively creating an interpolated feature space.
Corresponding label vectors were combined via element-wise summation and clipped to the [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] range. As
a result, the final targets could simultaneously encode both soft and hard class probabilities.
        </p>
        <p>In subsequent pseudo-labeling iterations, models were trained on the composition of previously
selected pseudo samples. We employed both NFNet-L0 and EficientNetV2-S architectures, again using
pretrained initializations and enhanced baseline configurations. Each training experiment consisted
of five folds, resulting in ten models (five per architecture) used to generate predictions for the next
pseudo-labeling round. The number of selected samples per iteration is shown in Table 4. Described
iterative pseudo-labeling algorithm is shown in Figure 3</p>
        <p>One limitation of this iterative pseudo-labeling scheme is that samples from early iterations—predicted
by weaker models—remain fixed in subsequent iterations. To mitigate this, we explored an alternative
approach in which soundscapes were split by fold, and each fold was predicted using models not trained
on it (see Figure 5). This out-of-fold (OOF) strategy enabled refreshing pseudo-labels at every iteration
and increased model diversity due to varying predictions across folds. However, this OOF approach
comes with trade-ofs: each 5-second chunk is predicted by only two models instead of ten, reducing
ensemble stability, and each model is trained on only 45 of the pseudo-labeled dataset, rather than the
full set.</p>
        <p>Finally, pseudo-labeled samples from both strategies were utilized independently and also merged into
a single dataset using the same "removal from previous iteration" logic (see Figure 3). Training models
on diferent subsets of pseudo data contributed to ensemble diversity and improved generalization on
the final leaderboard submissions.</p>
      </sec>
      <sec id="sec-5-5">
        <title>5.5. Data Curation</title>
        <p>We trained our models on 5-second segments randomly selected from the input audio. Based on past
experience, we initially tried three approaches for picking 5-second segments: random 5 seconds from
the whole audio, random 5 seconds from the first 7 seconds, and random 5 seconds from the first or
last 7 seconds. The reasoning for the last two approaches is that recordings are often started when the
target animal is already vocalizing and stopped when it ceases. This can help the model reduce false
positives by focusing on parts of the audio more likely to contain vocalizations. The 7-second window
was chosen to introduce variability while still capturing likely vocal activity.</p>
        <p>In initial tests, the last approach provided better results. However, some species had only a few
audios - as few as 2 in some cases - and we were concerned about overfitting. Therefore, we decided to
inspect the audios for those species to manually identify sections of vocalization. Three such cases are
presented below:
1. Vocalization with alien speech (Figure 7). Several audios contained a computer-synthesized
voice with an ID for the recording. Some audios also included a section in which the recordist
describes the audio, e.g., species recorded, location, temperature, microphone used, etc. In many
of these cases, the description contains no traces of vocalization or even the same background
noise as the sections in which there is vocalization. We refer to these two instances of human
voice as alien speech. For some audios, alien speech represented more than 90% of the recording.
2. Speech overlapping animal vocalization (Figure 8). In this case, the voice can be understood
as background noise.
3. Vocalization with periods of silence (Figure 9), i.e., when the animal is not vocalizing.</p>
        <p>All these cases would result in training with false positives, with the potential to hinder learning.
Hence, we eliminated those sections from our training. However, that ended up hurting our models.
The reasons were not entirely clear. It is possible that our audio curation was overly aggressive and
excluded some true positives. Alternatively, the presence of some false positives may have acted as
a form of regularization, improving generalization by reducing overfitting to the limited number of
available recordings.</p>
        <p>We also experimented with extensive manual curation for species with fewer than 30 samples,
listening to the recordings while simultaneously inspecting spectrograms and energy level charts. As
this approach was not scalable for larger classes, we additionally applied automatic speech detection and
made several attempts to adjust labels based on predictions from our strongest models. The intuition
was that if a model consistently assigned low probabilities to a label, that label was likely incorrect for
that segment; similarly, consistently high probabilities for a species not in the labels might indicate an
overlooked species. We tried using these predictions as soft labels, hard labels, or to cancel the original
annotations when probabilities were very low.</p>
        <p>Unfortunately, none of these curation strategies led to consistent performance improvements. In the
end, we decided to either use the whole audio or exclude only the sections of alien speech identified
manually or automatically.</p>
      </sec>
      <sec id="sec-5-6">
        <title>5.6. Postprocessing</title>
        <p>We attempted several forms of post-processing, which were based on two hypotheses:
• If an animal is vocalizing in an audio recording, it will appear in multiple segments.
• The vocalizations of some species follow rhythmic patterns. For example, an animal may vocalize
once every second for 16 seconds, then pause for 6 seconds, and repeat this cycle.</p>
        <p>From the first hypothesis, we derived the following methods:
• Mean: multiply the probability of a species in each segment by the average probability of the
species in the whole audio. The intuition is that if a species has a high probability in other
segments, a high probability in a given segment is more trustworthy when compared to other
audios. The converse is also true.</p>
        <p>postproc_prob,, = prob,, · ︃( 1 ∑︁ prob,′,</p>
        <p>′=1
)︃
(3)
–  – audio index,
–  – segment index (5-second chunk),
–  – species class index,
–  – total number of segments in audio .
• TopN: multiply the probability of a species in each segment by the average top N probabilities of
the species in the whole audio. The intuition for this approach was similar to Mean, with the
additional intention of ignoring periods of silence in which the species is not vocalizing.
1
postproc_prob,, = prob,, · ⎝ 
⎛</p>
        <p>⎞
∑︁ prob,′,⎠
′∈,
(4)
where:
– , – the set of indices for the top  values of prob(, :, ).</p>
        <p>–  – the number of top segments used for smoothing the probability.
• Convolution: adjust the probability of each segment by applying a convolution of the
probabilities of neighboring segments. The intuition is that if a species has a high probability in
neighboring segments, it is more likely to be vocalizing in a given segment.</p>
        <p>From the second hypothesis, we derived the following method:
• L2 model: A layer 2 model (L2) was trained to refine the predictions for each test audio. The
intuition behind this approach is that if a model can learn the rhythmic patterns of a species’
vocalization, it can correct errors from the layer 1 model (L1) and produce predictions that
better reflect the species’ natural behavior. The L2 model was trained on the pseudo-labeled
(see Section 5.4) soundscape recordings using species-level vocalization probabilities for each
5-second segment. To simulate prediction errors from the L1 model, random noise was added to
the original probabilities to create the inputs. The training labels were generated by thresholding
the original probabilities at 0.5, converting them into hard labels. Training was restricted to
species with at least 10 audio files. Several model architectures were evaluated, and a simple
convolution model with diferent kernels per species yielded the best results.</p>
        <p>Our tests showed that the first two methods improved the original predictions. Unfortunately, the L2
model did not yield improvements over the TopN method on the Public Leaderboard, even when the
two were combined. Combining the other methods with each other also did not result in further gains.
Our best results were consistently achieved using the TopN method with  = 1.</p>
      </sec>
      <sec id="sec-5-7">
        <title>5.7. Ensembling</title>
        <p>To obtain a robust final solution, an ensembling strategy was employed. As previously discussed, five
models (one from each training fold) were used per experiment to generate predictions for the Public
test set. In the final ensemble, we performed a simple average over the predictions of all five folds across
three selected experiments, resulting in an ensemble of 15 models. Postprocessing was then applied to
the averaged predictions.</p>
        <p>
          To select the optimal experiments for inclusion in the ensemble, we explored two strategies:
• Selecting high-performing models based on their Public Leaderboard scores and maximizing the
ensemble’s Public score directly.
• Selecting three experiments using Optuna [
          <xref ref-type="bibr" rid="ref19">40</xref>
          ], with the objective of maximizing the OOF
validation macro ROC AUC. Additionally, we tracked the individual contribution of each experiment
by measuring the performance gain over the best single model.
        </p>
        <p>The Optuna-based selection yielded the highest Private score, suggesting that even when validation
metrics are only weakly correlated with test performance, they may still provide useful guidance for
constructing efective ensembles.</p>
        <p>Ranking-based ensembling was also explored; however, it did not lead to competitive performance.
To enable compatibility with postprocessing, the pipeline applied postprocessing first, followed by
rank transformation, and then arithmetic averaging. We hypothesize that our models are relatively
well-calibrated, so their raw probability outputs contribute meaningfully to the final score. Moreover,
applying postprocessing directly to individual models often resulted in degraded performance compared
to applying it at the ensemble level.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
      <sec id="sec-6-1">
        <title>6.1. Main Results and Ablation Study</title>
        <p>The progression of model improvements is summarized in Table 5. The baseline models already
demonstrated strong performance on the training data, although a substantial gap was observed
between cross-validation (CV) and Public/Private test scores. NFNet-L0 showed a clear advantage over
EficientNetV2-S at this stage. Introducing the proposed enhancements (see Section 5.2) resulted in
comparable CV performance, while yielding noticeable improvements in Private scores and modest
gains on the Public set. This behavior aligns with expectations, as the enhancements were designed to
handle noisy targets and solve the class imbalance problem.</p>
        <p>The introduction of transfer learning further boosted all metrics, reinforcing the efectiveness of
pretraining on large in-domain birdcall datasets. Interestingly, this stage reversed the earlier trend of
NFNet-L0 superiority, with EficientNetV2-S now outperforming it. This shift may be attributed to
diferences in optimization policies between the architectures and requires further investigation in
future work.</p>
        <p>The most substantial gains were achieved through pseudo-labeling, which significantly reduced the
domain shift. Validation scores remained stable or slightly improved, suggesting that the pseudo-labeled
data did not introduce excessive noise. In contrast, Public and Private scores improved markedly.
Both full and OOF pseudo-labeling strategies benefited from a second iteration, while performance
plateaued or slightly declined in the third, particularly for EficientNetV2-S. This may indicate that the
most informative (i.e., less ambiguous) segments had already been captured. No consistent preference
emerged between full and OOF pseudo-labeling strategies.</p>
        <p>Finally, postprocessing (TopN method with  = 1) contributed an additional 1–1.5% improvement
in ROC AUC on the Public and Private sets, further validating our hypothesis from Section 5.6 regarding
the repeated vocalizations of birds across the same file. We do not report postprocessing scores on local
validation, as 5-second chunk-level labels are not available for the training data.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Ensembling Results</title>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Impact of Additional Training Data</title>
        <p>Table 7 presents the results of evaluating diferent additional data configurations. Compared to
BirdCLEF 20236, in this year’s competition the inclusion of additional data did not yield substantial
performance improvements. In fact, incorporating large volumes of extra data slightly degraded results
on both the Public and Private leaderboards. This behavior can be explained by the already suficient
size of the 2025 dataset, where enriching it with more samples brings limited benefit. Additionally, some
of the external data sources contained systematic speech artifacts (see Section 5.5), which may have
introduced false positives and confused the models. Still, we hypothesize that performance for certain
undersampled species should be improved. A deeper analysis of this aspect is left for future research.</p>
      </sec>
      <sec id="sec-6-4">
        <title>6.4. Augmentation Ablation</title>
        <p>Table 8 shows the results of the augmentation ablation study. The augmentation ablation study
demonstrates that while MixUp does not lead to improvements on the Public test set, it significantly
boosts performance on the Private set and improves local validation scores. This suggests that MixUp
contributes to training more robust models. We hypothesize that the absence of performance gain on
the Public leaderboard may be due to a lower number of overlapping bird vocalizations per 5-second
chunk compared to the Private set. SpecAugment and RandomFiltering slightly decrease Public scores
but provide noticeable gains on the Private set and minor improvements in local validation. Overall,
these results support the conclusion that applying heavier augmentations leads to more robust models,
consistent with general theoretical expectations.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Limitations and Conclusions</title>
      <sec id="sec-7-1">
        <title>7.1. Limitations</title>
        <p>Our work has several limitations that should be acknowledged:
• Lack of Robust Evaluation: The local validation strategy does not reflect the real-world
deployment scenario of the system. Furthermore, the available soundscapes are biased towards a
single geographic location. As our approach relies heavily on pseudo labels derived from these
data, the model’s performance may degrade when applied to soundscapes from other regions.
• Limited Hyperparameter Optimization: The focus of this work was primarily on exploring
methodological improvements rather than optimizing configurations. As a result, we did not
perform an extensive hyperparameter search, especially for pseudo-labeling thresholds and model
settings, which may have limited the achievable performance.</p>
      </sec>
      <sec id="sec-7-2">
        <title>7.2. Conclusions</title>
        <p>In this work, we proposed a system that achieved state-of-the-art performance on the BirdCLEF+ 2025
task, placing 2nd overall with a Public Leaderboard score of 0.925 and a Private Leaderboard score of
0.928. Our approach is built on five key pillars: a strong baseline model, transfer learning from
largescale birdcall datasets, semi-supervised learning with elements of model distillation, postprocessing,
and ensembling. We further complemented our work with in-depth data analysis and ablation studies
to better understand the contributions of each component.</p>
        <p>
          Despite these achievements, there remains substantial room for future improvement. Potential
directions include:
• Developing a more robust evaluation framework by retraining the same pipelines on diferent
subsets of the data and evaluating them on soundscapes from diverse geographic locations.
Alternatively, training a unified model on all available soundscapes and classes may lead to more
generalizable performance.
• Employing model-driven filtering to identify and exclude non-vocalized audio fragments in the
training set, thereby reducing label noise and improving optimization eficiency.
• Leveraging the full capabilities of strong label prediction models such as [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], particularly during
inference, to better capture temporal bird vocalization patterns.
        </p>
        <p>• Addressing the challenge of undersampled species.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>We would like to thank the organizers of BirdCLEF, the Kaggle team, and LifeCLEF for hosting this
competition. We are especially grateful to the Armed Forces of Ukraine—without their resilience and
protection, this work would not have been possible.</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used OpenAI’s ChatGPT in order to improve clarity,
coherence, and LaTeX formatting. After using this tool, the authors reviewed and edited the content as
needed and take full responsibility for the publication’s content.
[17] R. Kulkarni, B. Chandarana, Audio based species identification for monosyllabic call birds
using convolutional neural networks, International Journal for Research in Applied Science &amp;
Engineering Technology (IJRASET) 7 (2019).
[18] B. Desplanques, J. Thienpondt, K. Demuynck, Ecapa-tdnn: Emphasized channel attention,
propagation and aggregation in tdnn based speaker verification, arXiv preprint arXiv:2005.07143
(2020).
[19] M. Tan, Q. Le, Eficientnet: Rethinking model scaling for convolutional neural networks, in:</p>
      <p>International conference on machine learning, PMLR, 2019, pp. 6105–6114.
[20] A. Brock, S. De, S. L. Smith, K. Simonyan, High-performance large-scale image recognition without
normalization, in: International conference on machine learning, PMLR, 2021, pp. 1059–1071.
[21] M. Tan, Q. Le, Eficientnetv2: Smaller models and faster training, in: International conference on
machine learning, PMLR, 2021, pp. 10096–10106.
[22] J. Li, Z. Yu, Z. Du, L. Zhu, H. T. Shen, A comprehensive survey on source-free domain adaptation,</p>
      <p>IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).
[23] G. Wilson, D. J. Cook, A survey of unsupervised deep domain adaptation, ACM Transactions on</p>
      <p>Intelligent Systems and Technology (TIST) 11 (2020) 1–46.
[24] Y. Ganin, V. Lempitsky, Unsupervised domain adaptation by backpropagation, in: International
conference on machine learning, PMLR, 2015, pp. 1180–1189.
[25] A. Miyaguchi, A. Cheung, M. Gustineli, A. Kim, Transfer learning with pseudo multi-label birdcall
classification for ds@ gt birdclef 2024, arXiv preprint arXiv:2407.06291 (2024).
[26] D.-H. Lee, et al., Pseudo-label: The simple and eficient semi-supervised learning method for deep
neural networks, in: Workshop on challenges in representation learning, ICML, volume 3, Atlanta,
2013, p. 896.
[27] P. Kage, J. C. Rothenberger, P. Andreadis, D. I. Diochnos, A review of pseudo-labeling for computer
vision, arXiv preprint arXiv:2408.07221 (2024).
[28] J. Jiang, Y. Shu, J. Wang, M. Long, Transferability in deep learning: A survey, arXiv preprint
arXiv:2201.05867 (2022).
[29] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, M. D. Plumbley, Panns: Large-scale pretrained audio
neural networks for audio pattern recognition, IEEE/ACM Transactions on Audio, Speech, and
Language Processing 28 (2020) 2880–2894.
[30] D. B. Efremova, M. Sankupellay, D. A. Konovalov, Data-eficient classification of birdcall through
convolutional neural networks transfer learning, in: 2019 Digital image computing: Techniques
and applications (DICTA), IEEE, 2019, pp. 1–8.
[31] B. Ghani, T. Denton, S. Kahl, H. Klinck, Global birdsong embeddings enable superior transfer
learning for bioacoustic classification, Scientific Reports 13 (2023) 22876.
[32] California Academy of Sciences and the National Geographic Society, inaturalist, https://www.</p>
      <p>inaturalist.org/, 2025. Accessed: June 13, 2025.
[33] Visor de sonidos – colecciones humboldt, https://colecciones.humboldt.org.co/sonidos/visor-csa/,
2025. Accessed: June 13, 2025.
[34] S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, S. Xie, Convnext v2: Co-designing and
scaling convnets with masked autoencoders, in: Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, 2023, pp. 16133–16142.
[35] K. Xu, D. Feng, H. Mi, B. Zhu, D. Wang, L. Zhang, H. Cai, S. Liu, Mixup-based acoustic scene
classification using multi-channel convolutional neural network, in: Advances in Multimedia
Information Processing–PCM 2018: 19th Pacific-Rim Conference on Multimedia, Hefei, China,
September 21-22, 2018, Proceedings, Part III 19, Springer, 2018, pp. 14–23.
[36] K. J. Piczak, ESC: Dataset for Environmental Sound Classification, in: Proceedings of the 23rd
Annual ACM Conference on Multimedia, ACM Press, ????, pp. 1015–1018. URL: http://dl.acm.org/
citation.cfm?doid=2733373.2806390. doi:10.1145/2733373.2806390.
[37] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, Q. V. Le, Specaugment: A simple
data augmentation method for automatic speech recognition, arXiv preprint arXiv:1904.08779
(2019).</p>
    </sec>
    <sec id="sec-10">
      <title>A. Data Tables and Figures</title>
      <p>176
8
78
484
63
110
221
51
254
111
1543
1200
870
862
799
776
729
613
591
588</p>
    </sec>
    <sec id="sec-11">
      <title>B. Detailed Training and Inference Setup</title>
      <p>
        All our models were trained using PyTorch [
        <xref ref-type="bibr" rid="ref20">41</xref>
        ] and Lightning [
        <xref ref-type="bibr" rid="ref21">42</xref>
        ] frameworks. We used the timm
library [
        <xref ref-type="bibr" rid="ref22">43</xref>
        ] for CNN backbones.
      </p>
      <p>
        All models, including pretraining, were trained for 50 epochs with a batch size of 64. For
EficientNetV2-S models, we used the AdamW [
        <xref ref-type="bibr" rid="ref23">44</xref>
        ] optimizer with a learning rate of 1 × 10− 4,
 = 1 × 10− 8, and  = (0.9, 0.999). For NFNet-L0 models, RAdam [
        <xref ref-type="bibr" rid="ref24">45</xref>
        ] was used with a learning rate
of 1 × 10− 3. The learning rate followed a cosine schedule down to 1 × 10− 6 without warm-up.
      </p>
      <p>
        We used nnAudio [
        <xref ref-type="bibr" rid="ref25">46</xref>
        ] for on-the-fly spectrogram extraction during the forward pass. The parameters
used are summarized in Table 11. After extraction, the spectrogram was converted to decibels using
AmplitudeToDb with top_db = 80 and amin = 1 × 10− 10, then standardized and scaled to the [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ]
range.
      </p>
      <p>Regarding augmentations, MixUp was applied with a probability of 50%. Background noise
augmentation was also used with a 50% probability, equally drawing noise samples from the soundscape dataset
and ESC-50.</p>
      <p>
        The classification head used 512 hidden channels, with a dropout of 0.25 after the CNN encoder
and 0.5 after the hidden layer. A ReLU activation was used in the hidden layer. To aggregate CNN
embeddings across the frequency dimension, GeM [
        <xref ref-type="bibr" rid="ref26">47</xref>
        ] pooling was employed.
      </p>
      <p>Nearly all experiments were trained on random 5-second chunks from training samples (see
Section 5.5). Inference on soundscapes was performed using non-overlapping 5-second windows. For
validation, predictions on training files were aggregated by taking the maximum probability across
segments per class.</p>
      <p>
        To optimize data loading during training, audio files were converted to h5py [
        <xref ref-type="bibr" rid="ref27">48</xref>
        ] format, enabling
byte-wise access to random 5-second chunks. For inference, model precision was reduced to FP16 and
exported to the OpenVINO format [
        <xref ref-type="bibr" rid="ref28">49</xref>
        ], which also facilitates out-of-the-box deployment. On inference,
spectrogram extraction was done once and reused across all ensemble models.
      </p>
      <p>All training jobs were executed on two diferent setups:
• Kaggle Notebook: Used a P100 GPU. Training all 5 folds typically required 15 to 18 hours.
• Dev Box: Equipped with an NVIDIA GeForce RTX 4090 (24 GB). Training all 5 folds took between
5.2 and 5.7 hours, mainly due to the use of h5py, higher I/O throughput, and a faster CPU.</p>
    </sec>
    <sec id="sec-12">
      <title>C. Data Curation Examples</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Alotaibi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nassif</surname>
          </string-name>
          ,
          <article-title>Artificial intelligence in environmental monitoring: in-depth analysis</article-title>
          ,
          <source>Discover Artificial Intelligence</source>
          <volume>4</volume>
          (
          <year>2024</year>
          )
          <fpage>84</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Stowell</surname>
          </string-name>
          ,
          <article-title>Computational bioacoustics with deep learning: a review and roadmap</article-title>
          .
          <source>peerj 10</source>
          ,
          <year>e13152</year>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dane</surname>
          </string-name>
          , S. Kahl, tom denton,
          <source>T. Denton</source>
          , Cornell birdcall identification, https://kaggle.com/competitions/birdsong-recognition,
          <year>2020</year>
          . Kaggle.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dane</surname>
          </string-name>
          , S. Kahl, tom denton,
          <source>T. Denton</source>
          , Birdclef 2021 - birdcall identification, https://kaggle.com/competitions/birdclef-2021,
          <year>2021</year>
          . Kaggle.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Navine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          , T. Denton,
          <year>Birdclef 2022</year>
          , https://kaggle.com/ competitions/birdclef-2022,
          <year>2022</year>
          . Kaggle.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          , T. Denton,
          <year>Birdclef 2023</year>
          , https://kaggle.com/competitions/birdclef-2023,
          <year>2023</year>
          . Kaggle.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          , Maggie,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <year>Birdclef 2024</year>
          , https://kaggle.com/ competitions/birdclef-2024,
          <year>2024</year>
          . Kaggle.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Cañas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Demkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          , T. Denton, Birdclef+
          <year>2025</year>
          , https://kaggle.com/ competitions/birdclef-2025,
          <year>2025</year>
          . Kaggle.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Adam</surname>
          </string-name>
          , et al.,
          <source>Overview of lifeclef</source>
          <year>2025</year>
          :
          <article-title>Challenges on species presence prediction and identification, and individual animal identification</article-title>
          ,
          <source>in: International Conference of the Cross-Language Evaluation Forum for European Languages</source>
          , Springer,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Cañas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Toro-Gómez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rodriguez-Buritica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Benavides-Lopez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Ulloa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Caycedo-Rosales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Glotin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-P.</given-names>
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Planqué</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of BirdCLEF+
          <year>2025</year>
          <article-title>: Multi-taxonomic sound identification in the middle magdalena valley, colombia</article-title>
          ,
          <source>in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Xeno-Canto</surname>
            <given-names>Foundation</given-names>
          </string-name>
          ,
          <article-title>Xeno-canto: Sharing bird sounds from around the world</article-title>
          , https:// xeno-canto.org/,
          <year>2025</year>
          . Accessed: June 13,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Khamparia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. G.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Khanna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Tiwari</surname>
          </string-name>
          ,
          <article-title>Sound classification using convolutional neural network and tensor deep stacking network</article-title>
          ,
          <source>IEEE Access 7</source>
          (
          <year>2019</year>
          )
          <fpage>7717</fpage>
          -
          <lpage>7727</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lasseck</surname>
          </string-name>
          ,
          <article-title>Acoustic bird detection with deep convolutional neural networks</article-title>
          .,
          <source>in: DCASE</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>143</fpage>
          -
          <lpage>147</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Adavanne</surname>
          </string-name>
          , T. Virtanen,
          <article-title>Sound event detection using weakly labeled dataset with stacked convolutional and recurrent neural network</article-title>
          ,
          <source>arXiv preprint arXiv:1710.02998</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Wood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Eibl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <article-title>Birdnet: A deep learning solution for avian diversity monitoring</article-title>
          ,
          <source>Ecological Informatics</source>
          <volume>61</volume>
          (
          <year>2021</year>
          )
          <fpage>101236</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Abdoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cardinal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Koerich</surname>
          </string-name>
          ,
          <article-title>End-to-end environmental sound classification using a 1d convolutional neural network</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>136</volume>
          (
          <year>2019</year>
          )
          <fpage>252</fpage>
          -
          <lpage>263</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [38]
          <string-name>
            <surname>T.-Y. Lin</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dollár</surname>
          </string-name>
          ,
          <article-title>Focal loss for dense object detection</article-title>
          ,
          <source>in: Proceedings of the IEEE international conference on computer vision</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>2980</fpage>
          -
          <lpage>2988</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>R.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kornblith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <article-title>When does label smoothing help?</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>32</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>T.</given-names>
            <surname>Akiba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yanase</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ohta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Koyama</surname>
          </string-name>
          ,
          <article-title>Optuna: A next-generation hyperparameter optimization framework</article-title>
          ,
          <source>in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>A.</given-names>
            <surname>Paszke</surname>
          </string-name>
          ,
          <string-name>
            <surname>Pytorch:</surname>
          </string-name>
          <article-title>An imperative style, high-performance deep learning library</article-title>
          , arXiv preprint arXiv:
          <year>1912</year>
          .
          <volume>01703</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>W.</given-names>
            <surname>Falcon</surname>
          </string-name>
          ,
          <article-title>The PyTorch Lightning team</article-title>
          ,
          <source>PyTorch Lightning</source>
          ,
          <year>2019</year>
          . URL: https://github.com/ Lightning-AI/lightning. doi:
          <volume>10</volume>
          .5281/zenodo.3828935.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>R.</given-names>
            <surname>Wightman</surname>
          </string-name>
          , Pytorch image models, https://github.com/rwightman/pytorch-image-models,
          <year>2019</year>
          . doi:
          <volume>10</volume>
          .5281/zenodo.4414861.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>I.</given-names>
            <surname>Loshchilov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hutter</surname>
          </string-name>
          ,
          <article-title>Decoupled weight decay regularization</article-title>
          ,
          <source>arXiv preprint arXiv:1711.05101</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>L.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          , J. Han,
          <article-title>On the variance of the adaptive learning rate and beyond</article-title>
          , arXiv preprint arXiv:
          <year>1908</year>
          .
          <volume>03265</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>K. W.</given-names>
            <surname>Cheuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Anderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Agres</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Herremans, nnaudio: An on-the-fly gpu audio to spectrogram conversion toolbox using 1d convolutional neural networks</article-title>
          ,
          <source>IEEE Access 8</source>
          (
          <year>2020</year>
          )
          <fpage>161981</fpage>
          -
          <lpage>162003</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>F.</given-names>
            <surname>Radenović</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Tolias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Chum</surname>
          </string-name>
          ,
          <article-title>Fine-tuning cnn image retrieval with no human annotation</article-title>
          ,
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>41</volume>
          (
          <year>2018</year>
          )
          <fpage>1655</fpage>
          -
          <lpage>1668</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [48]
          <string-name>
            <given-names>A.</given-names>
            <surname>Collette</surname>
          </string-name>
          , Python and HDF5,
          <string-name>
            <surname>O'Reilly</surname>
          </string-name>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [49]
          <string-name>
            <given-names>OpenVINO</given-names>
            <surname>Toolkit</surname>
          </string-name>
          <string-name>
            <surname>Contributors</surname>
          </string-name>
          , OpenVINO Toolkit, https://github.com/openvinotoolkit/ openvino,
          <year>2024</year>
          . Accessed:
          <fpage>2025</fpage>
          -06-13.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>