<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>One Detector per Bird: A Scalable Binary Classification Approach for BirdCLEF+ 2025⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shreejith Suthraye Gokulnath</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chandrima Das</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arya Gaikwad</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Keerthana Senthilnathan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shruti Prasad Sawant</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of California</institution>
          ,
          <addr-line>San Diego (UCSD), La Jolla, CA 92093</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Automated bird sound classification has become an essential tool for ecological monitoring and biodiversity research. The BirdCLEF 2025 challenge presents a large-scale, multi-species audio classification task with over 206 target species recorded under highly variable acoustic conditions. To address the challenges of extreme class imbalance, overlapping species, and domain variability, we propose a modular framework that trains one binary classifier per species. This approach allows for targeted feature engineering, interpretability, and scalable parallel training. We develop the system in three iterative stages: starting with a single-species prototype, expanding to a multi-species configuration, and finally scaling to the full species set with metadata integration and targeted data augmentation. Audio recordings are encoded using binary frequency activation patterns and log-mel statistics, with classification performed using Random Forests and XGBoost. Although local cross-validation results showed strong performance, domain shifts in the test data highlighted the complexity of ecologically real-world soundscapes. Our findings underscore the value of species-specific modeling strategies and ofer a lfexible framework for future bioacoustic monitoring systems. We participated as team echo in the BirdCLEF 2025 challenge and achieved a public leaderboard score of 0.568 and a private leaderboard score of 0.561. These results reflect the generalization dificulty posed by diverse ecological soundscapes and validate the scalability of our binary-classifier approach.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;BirdCLEF 2025</kwd>
        <kwd>Bird sound classification</kwd>
        <kwd>Bird sound classification</kwd>
        <kwd>Bioacoustics</kwd>
        <kwd>Species detection</kwd>
        <kwd>Binary classifiers</kwd>
        <kwd>Class imbalance</kwd>
        <kwd>XGBoost</kwd>
        <kwd>Ecological monitoring</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Bird sound classification plays a critical role in biodiversity monitoring and ecological research. As the
global biodiversity crisis intensifies, monitoring wildlife populations has become more critical than ever.
Birds, as highly responsive indicators of ecosystem health, are central to many conservation eforts.
With the growing availability of passive acoustic monitoring systems, large volumes of field recordings
are being collected across diverse habitats ofering a unique window into bird communities at scale.
However, extracting meaningful insights from these recordings requires automated systems capable of
identifying species accurately under real-world conditions, including overlapping calls, background
noise, and vast species diversity.Recent work has highlighted the growing potential of passive acoustic
monitoring tools enhanced with deep learning, few-shot detection, and metadata-aware models.</p>
      <p>The BirdCLEF 2025 challenge addresses this need by tasking participants with detecting bird species in
long-form audio recordings. Unlike controlled laboratory datasets, these recordings are heterogeneous
in quality, length, and complexity. Moreover, the dataset presents a classic long-tail distribution: a
few species are well-represented, while many appear infrequently, making traditional multi-label
classification approaches less efective.</p>
      <p>To tackle these challenges, we adopt a modular, species-specific modeling strategy. Instead of training
a single multi-label classifier, we train a dedicated binary model for each of the 206 target species. This
design provides several advantages: it handles extreme class imbalance more efectively, allows for
species-specific feature tuning and error analysis, and enables parallel model training.</p>
      <p>Our development process unfolded in three stages. We began with a single-species prototype to
validate our core pipeline, then expanded to a three-species setup to test generalization, and finally
scaled to the full species set using metadata-aware preprocessing, adaptive thresholding, and targeted
augmentation for underrepresented classes. Our models rely on interpretable binary frequency
activation features derived from short-time Fourier transforms and log-mel spectrograms, combined with
ensemble learning methods such as Random Forests and XGBoost.</p>
      <p>Despite promising validation results, our final leaderboard performance highlighted persistent
challenges in domain generalization, a common issue in ecological AI. Nonetheless, our results demonstrate
that modular, per-species classification is a viable and scalable approach for large-scale bioacoustic
monitoring. This work contributes a reproducible framework for future research and underscores the
value of flexible, interpretable systems in complex environmental tasks.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset</title>
      <p>The BirdCLEF 2025 challenge dataset encompasses multiple components to facilitate automated bird
sound recognition:
1. Train audio: The BirdCLEF 2025 training dataset comprises short audio recordings of individual
vocalizations from a wide range of species, including birds, amphibians, mammals, and insects.
These recordings were contributed by three primary sources:
• xeno-canto.org
• iNaturalist
• The Colombian Sound Archive (CSA), curated by the Humboldt Institute for Biological</p>
      <p>Resources Research in Colombia
Resampling of the recordings is done in the dataset to a uniform sampling rate (32kHz) to align
with the test set audio. The format of the files is OGG which ofers a balance between compression
and quality. No additional data was downloaded from xeno-canto or iNaturalist. Each audio file
follows the following naming convention: [collection][file_id_in_collection].ogg where the prefix
identifies the source collection.
2. Test Soundscapes: The test dataset contains approximately 700 soundscape recordings having a
duration of exactly one minute. They are also provided in OGG format. These audio files have
been resampled to 32 kHz, ensuring consistency with the training data.</p>
      <p>The test_soundscapes directory is automatically populated with the test audio files when the
notebook is submitted to the competition platform. No inference of content or ordering is done
from the filename alone as they follow a randomized pattern of the form soundscape_xxxxxx.ogg.
3. Train Soundscapes: The training dataset includes a collection of unlabeled soundscapes,
long-form audio recordings captured at the same general locations as the test soundscapes.
These files are intended to help understand environmental background noise, acoustic context,
and species co-occurrence patterns in realistic settings. Each file is named using the format:
[site]_[date]_[local_time].ogg, which encodes the recording site identifier, date, and local time
of the capture. The recordings do not overlap with the hidden test soundscapes inspite of them
being recorded at the same geographic region. Therefore overfitting will be avoided and the
model can generalize across unseen environments.
4. Training metadata: The dataset contains a csv file that provides metadata for each training
audio recording. It provides information like rating of an audio, geographic diversity etc. A few
columns include:
• Primary Label: It consists of a standardized species code which represents the primary
vocalization in the recording. For birds, this corresponds to the eBird species code (e.g., gretin1
for Great Tinamou); for non-bird taxa, the iNaturalist taxon ID is used. These codes can often
be appended to URLs for further species information such as https://ebird.org/species/gretin1
for the Great Tinamou.
• Secondary Label: It contains a list of other species that are annotated by recordists that
are also present in the background of the recording. The field maybe incomplete in some
cases and is treated by caution.
• Latitude and Longitude: This column contains the latitude and longitude (geographic
coordinates) which indicate the location of the recording capture. These are used to understand
regional vocal dialects which may be present in some bird species.
• Author: This column contains the name of the user who contributed the recording. It can
also have a value as "unknown" indicating that the author chose to remain anonymous and
was not recorded.
• Filename: It contains the name of the associated audio file, which also encodes its source
collection.
• Rating: Rating varies from 1(low) to 5(high) and is provided by the users of Xeno-canto. A
value of 0 implies that no rating is available. In addition to that, iNaturalist and CSA do not
provide quality ratings.
• Collection: It indicates the origin of the recording: XC (xeno-canto), iNat (iNaturalist), or</p>
      <p>CSA (Colombian Sound Archive). This field also aligns with the prefix of the filename.
5. Sample Submission: This file contains the valid sample submission format which contains a
csv of the row_id and species_id. The probability of the presence of each species needs to be
predicted for each row.
6. Taxonomy: This file consists of data on diferent species along with taxonomic hierarchy and
standardized codes.
7. Recording Location: This file contains location metadata associated with each recording,
enabling spatially informed analysis.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset Analysis</title>
      <sec id="sec-3-1">
        <title>3.1. Species Distribution and Class Imbalance</title>
        <p>Comprehensive analysis reveals 206 unique species represented in the dataset, exhibiting typical
ecological abundance patterns. The species distribution demonstrates moderate imbalance, with 23
species (11.2%) having fewer than 5 recordings, 39 species (18.9%) having fewer than 10 recordings,
and 78 species (37.9%) having fewer than 50 recordings. This long-tail distribution presents significant
challenges for machine learning approaches, particularly for rare and endangered species that are
primary targets of conservation eforts.</p>
        <p>The top 20 most represented species show clear dominance patterns, with several species exceeding
800 recordings while many others fall below 100 recordings. This class imbalance necessitates
specialized training strategies, including weighted loss functions, balanced sampling techniques, and data
augmentation approaches tailored to minority classes.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Audio Characteristics and Quality Assessment</title>
        <p>Statistical analysis of audio characteristics reveals substantial temporal heterogeneity within the dataset.
Duration statistics show a right-skewed distribution with the first quartile at 13.0 seconds, median at
22.6 seconds, third quartile at 45.2 seconds, and maximum extending to 249.8 seconds. This variation
reflects the natural diversity of vocalization behaviors across taxonomic groups, from brief insect chirps
to extended mammalian calls.</p>
        <p>All analyzed recordings maintain consistent technical specifications with 32 kHz sampling rates and
OGG compression format. Root mean square energy levels average 0.033 with standard deviation of
0.026, indicating moderate amplitude variability across recordings. Maximum amplitude values range
from 0.005 to 1.10, suggesting varying recording conditions and source-to-microphone distances typical
of field recordings.</p>
        <p>The substantial presence of recordings exceeding 30 seconds (82 files in the analyzed sample) indicates
the need for segmentation strategies during model training to maintain computational eficiency while
preserving important temporal patterns in longer vocalizations.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Spectral Feature Analysis</title>
        <p>Spectral analysis reveals distinct acoustic signatures across the dataset that reflect the multi-taxonomic
nature of the collection. Spectral centroid analysis shows a mean frequency focus at 3,979 Hz with
standard deviation of 1,568 Hz, indicating substantial spectral diversity across species. The distribution
ranges from low-frequency vocalizations at 1,192 Hz to high-frequency signals reaching 11,221 Hz,
encompassing the full range of vertebrate and invertebrate acoustic communication.</p>
        <p>Spectral rollof patterns demonstrate that 85% of signal energy concentrates below 7,079 Hz on average,
with considerable variation (standard deviation: 2,206 Hz). This frequency distribution supports the
chosen 32 kHz sampling rate, ensuring adequate capture of high-frequency components while avoiding
unnecessary computational overhead.</p>
        <p>Zero-crossing rate analysis reveals mean values of 0.191 with standard deviation of 0.113, indicating
moderate signal complexity. The range from 0.025 to 0.628 reflects the diversity from tonal bird songs
(low ZCR) to broadband insect sounds (high ZCR), providing discriminative features for taxonomic
classification.</p>
        <p>Energy distribution analysis across frequency bands shows mean low-frequency energy at -24.9 dB,
mid-frequency at -29.2 dB, and high-frequency at -42.1 dB. This pattern indicates stronger representation
in lower frequency ranges, consistent with the dominance of larger vertebrate species in the collection,
while maintaining suficient high-frequency content for insect and small vertebrate classification.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Multi-taxonomic Acoustic Characteristics</title>
        <p>The dataset’s multi-taxonomic composition introduces unique analytical challenges absent from previous
bird-focused competitions. Acoustic features span multiple orders of magnitude in temporal and spectral
domains, requiring robust normalization and feature engineering approaches. Spectral bandwidth
measurements average 3,073 Hz with standard deviation of 715 Hz, reflecting the diverse acoustic niches
occupied by diferent taxonomic groups.</p>
        <p>Mel-frequency cepstral coeficient (MFCC) analysis across the first 13 coeficients reveals
speciesspecific patterns suitable for machine learning classification. The coeficient distributions show suficient
inter-class variation to support automated taxonomic identification while maintaining intra-class
consistency necessary for reliable model training.</p>
        <p>Dominant frequency analysis identifies primary spectral peaks averaging 2,066 Hz, with secondary
and tertiary peaks providing harmonic structure information crucial for species discrimination. The
frequency peak patterns exhibit taxonomic clustering, with insects typically showing higher primary
frequencies than amphibians and mammals.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>The BirdCLEF 2025 challenge focuses on detecting bird species from long-form audio recordings that
vary widely in quality, length, and number of overlapping species. Rather than adopting a single
multi-label model across all 206 species, we pursued a modular approach by training a separate binary
classifier per species. This decision was driven by several practical and scientific motivations:
• Extreme class imbalance: Many species have few labeled clips, while others dominate the
dataset.
• Debuggability and interpretability: Per-species models allowed easier error analysis and
feature tuning.
• Parallelizability: Independent models could be trained concurrently with minimal
crossdependencies.</p>
      <p>We developed this system in three iterative stages, progressively increasing the complexity and
generality of our pipeline:
1. A single-bird prototype focused on a single species (grekis) to validate core ideas and gain early
feedback.
2. A three-species extension to generalize the architecture, improve robustness, and explore
cross-species diferences.
3. A full-scale system spanning all 206 species, incorporating metadata, dynamic threshold tuning,
and robust pre-processing.</p>
      <p>In the following sections, we describe each stage in detail, highlighting design decisions, modeling
strategies, and lessons learned as we scaled up the system.</p>
      <sec id="sec-4-1">
        <title>4.1. Stage 1: Audio Preprocessing and Feature Extraction</title>
        <p>
          Based on comprehensive dataset analysis, we implemented a standardized preprocessing pipeline
optimized for the multi-taxonomic characteristics of BirdCLEF+ 2025. All audio files undergo resampling to
32 kHz to maintain consistency with the existing dataset standard, followed by amplitude normalization
to the [
          <xref ref-type="bibr" rid="ref1">-1, 1</xref>
          ] range to account for varying recording conditions across field sites.
        </p>
        <p>Duration standardization employs a segmentation strategy for recordings exceeding 30 seconds,
creating overlapping 5-second windows with 50% overlap to preserve temporal context while
maintaining computational tractability. Shorter recordings below 5 seconds receive zero-padding to ensure
consistent input dimensions for neural network architectures.</p>
        <p>Quality enhancement utilizes adaptive spectral subtraction for noise reduction, particularly beneficial
for field recordings containing environmental interference. The approach estimates noise characteristics
from low-energy segments and applies frequency-domain filtering to enhance signal-to-noise ratios
while preserving essential acoustic features across all taxonomic groups.</p>
        <p>Mel-spectrogram generation employs optimized parameters derived from spectral analysis results:
128 mel bins spanning the 50-16,000 Hz frequency range, a 2048-sample FFT window with a 512-sample
hop length, providing 16ms temporal resolution optimal for capturing rapid acoustic transients in insect
vocalizations while maintaining suficient frequency resolution for mammalian and amphibian calls.</p>
        <p>The mel-scale transformation provides perceptually relevant frequency representation particularly
suited to the diverse acoustic characteristics observed in the dataset. Logarithmic amplitude scaling
enhances dynamic range representation, improving model sensitivity to quiet vocalizations against
background noise commonly present in field recordings.</p>
        <p>Feature augmentation strategies address class imbalance through targeted data synthesis.
Timeshifting augmentation (±0.5 seconds) accounts for temporal alignment variations, while pitch-shifting
(±2 semitones) increases sample diversity for underrepresented species. Mixup augmentation between
taxonomically similar species enhances model robustness while preserving biological acoustic
relationships.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Stage 2: Single-Bird Detector for grekis</title>
        <p>The development of our pipeline began with a single-bird detection experiment targeting the species
grekis, which had the highest number of labeled recordings in the dataset (~990 files). Focusing on
this class enabled rapid prototyping, while still being representative of a common class in a highly
imbalanced species distribution.</p>
        <sec id="sec-4-2-1">
          <title>4.2.1. Problem Setup and Negative Sampling</title>
          <p>We formulated a binary classification task to distinguish between recordings of grekis and recordings
of other common species. Specifically, the positive class comprised all grekis recordings, while the
negative class was drawn from a uniformly sampled subset of the next most frequent species in the
training set (e.g., compau, trokin, roahaw, etc.). This ensured a roughly balanced dataset for training
and evaluation, avoiding dominance by any single non-target species.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. Binary Frequency-Binned Feature Representation</title>
          <p>To encode audio signals in a compact, interpretable form, we devised a fixed-length binary vector
representation that captures frequency-wise activity patterns. Each audio file was split into 5-second
chunks, and for each chunk, we computed the Short-Time Fourier Transform (STFT) to obtain the
timefrequency representation. The energy spectrum was then converted into decibel scale via logarithmic
compression.</p>
          <p>The frequency axis was discretized into 3200 bins spanning the 0–16 kHz range in 5 Hz intervals. For
each bin, we computed the maximum energy across all frames and applied an adaptive thresholding
scheme based on a chosen percentile (e.g., 85th percentile). If the peak energy in a bin exceeded the
threshold, that bin was marked as active (1); otherwise, it was marked inactive (0). This yielded a
3200-dimensional binary vector per chunk.</p>
          <p>Finally, chunk-level vectors for a file were aggregated using the median across all chunks, and
binarized once more to yield a file-level binary vector. This process compresses high-dimensional
spectrotemporal dynamics into a sparse and interpretable representation, while preserving discriminative
frequency activity.</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>4.2.3. Model Training and Evaluation</title>
          <p>We trained a binary classifier using a Random Forest model from scikit-learn, configured with 500
trees and class_weight=’balanced’ to account for any residual imbalance. A 5-fold stratified
cross-validation was employed to evaluate generalization performance and mitigate variance across
folds.</p>
          <p>To enhance feature discrimination, we empirically tuned the adaptive threshold percentile and
observed that values in the range 74–78% yielded the best performance, with the optimal value being
76.7%. At this setting, we achieved an average F1-score of 0.742 across validation folds, indicating strong
separability between grekis and other frequent species using our binary frequency encoding.</p>
          <p>This single-bird detector stage served as a conceptual and technical prototype for subsequent
multibird detectors, validating our feature extraction and classification pipeline.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Stage 3: Multi-Bird Extension (3-Bird Detectors)</title>
        <p>Building upon the single-bird prototype, we next extended our binary classification framework to three
the most frequently occurring species in the dataset: grekis, compau, and trokin. The objective
was to assess how well per-species one-vs-rest detectors could generalize across multiple species, still
maintaining species-specific specialization.</p>
        <p>For each of the three species, we trained a separate binary classifier. Each model’s positive samples
consisted of all recordings labeled with that species, while the negative samples were randomly drawn
from the remaining top-10 most frequent species, excluding the current positive class. To mitigate label
imbalance, we applied stratified random undersampling to equalize the number of positive and negative
examples per classifier.</p>
        <sec id="sec-4-3-1">
          <title>4.3.1. Feature Extraction and Thresholding</title>
          <p>We reused the binary frequency-bin encoding from Stage 1. Each audio recording was divided into
5-second segments, and a 3200-dimensional binary feature vector was constructed per segment. This
vector captures frequency band activation over 5 Hz bins spanning the 0–16 kHz range.</p>
          <p>For each frequency bin, energy values across frames were compared against an adaptive threshold
derived from the global energy distribution of the recording. The bin was considered active if the
maximum energy exceeded the threshold. Following empirical tuning during earlier experiments, we
ifxed the threshold percentile at 76.7% across all species, a setting that consistently yielded high F1-scores
without overfitting. Although adaptive tuning per-species could further optimize performance, fixing
this value provided both eficiency and generalization.</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>4.3.2. Modeling with XGBoost</title>
          <p>We adopted gradient-boosted decision trees, implemented via the XGBoost library, as our classifier of
choice. XGBoost builds an ensemble of shallow decision trees in sequence, where each successive tree
aims to minimize the error made by the previous ensemble through gradient-based optimization. This
approach allows for flexible modeling of non-linear relationships and is well-suited to tabular data with
mixed feature types.</p>
          <p>We adopted gradient-boosted decision trees via the XGBoost library as our classifier. Compared
to the Random Forest model used earlier, XGBoost provided better control over loss optimization,
regularization, and imbalance handling. We used the logloss evaluation metric during training, which
aligns with the probabilistic nature of the task and encourages well-calibrated outputs rather than hard
labels.</p>
          <p>To account for class imbalance within training splits, we set the scale_pos_weight parameter to
the ratio of negative to positive examples. This weighting biases the gradient updates in favor of the
minority class, improving recall without manual resampling.</p>
          <p>Each model was evaluated via 5-fold stratified cross-validation. Average metrics across folds were as
follows:
• grekis: F1 = 0.7592, AUC-ROC = 0.8308
• compau: F1 = 0.8360, AUC-ROC = 0.9178
• trokin: F1 = 0.7841, AUC-ROC = 0.8589</p>
          <p>Inference design and downstream aggregation are discussed separately in further Sections, once the
206-species models are introduced.</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Stage 4: Scaling to 206 Species</title>
        <p>Having validated the single-bird detector strategy in earlier stages, we next scaled our framework to all
206 species in the BirdCLEF 2025 dataset. This transition required substantial modifications to both our
data pipeline and modeling strategy to handle class imbalance, diverse metadata, and species-specific
noise characteristics. 45
Data Preprocessing and Metadata Filtering. We began by filtering the training data based on the
provided rating field, retaining all recordings with a rating ≥ 3.0 while allowing unrated recordings
(rating = 0) to remain. This step ensured low-quality samples were removed while not penalizing
datasets such as CSA and iNaturalist which do not provide ratings.</p>
        <p>We also incorporated geographic metadata (latitude, longitude) and one-hot encoded the
collection type (CSA, XC, iNat), recognizing their potential to capture regional dialects and recording
artifacts.</p>
        <sec id="sec-4-4-1">
          <title>Feature Extraction via Log-Mel and Binary Masking. Each audio file was converted to a 128-band</title>
          <p>log-mel spectrogram. From this representation, we computed three sets of features:
• The mean and standard deviation across time for each mel band;
• A binary frequency mask indicating activity in each band, computed by thresholding the mel
power values at the 90th percentile;
• Contextual features: latitude, longitude, and one-hot collection type.</p>
          <p>The final feature vector concatenated all of the above, yielding a consistent representation across
ifles of varying duration and origin.</p>
          <p>Label Construction and Sample Weighting. To enable per-species binary classification, we
trained an independent model for each species. For a given species , all recordings with 
as the primary_label were labeled positive. Optionally, recordings where  appeared as a
secondary_label were also included as positives, but with a down-weighted contribution
(controlled by secondary_label_weight = 0.5).</p>
          <p>To balance class contributions, we scaled positive and negative sample weights such that their total
weight was equal per species. This ensured that rare species were not underrepresented during model
training.
Model Architecture and Training. We used XGBoost classifiers as our modeling backbone. For
each of the 206 species, we trained a separate binary classifier using:
• n_estimators = 100
• max_depth = 6
• learning_rate = 0.1
• tree_method = hist (or gpu_hist if CUDA available)
• eval_metric = logloss</p>
          <p>For evaluation, we employed 5-fold stratified cross-validation. In cases where a species had fewer
than 5 positive samples, we reduced the number of folds to preserve class representation. Models with
insuficient class diversity were skipped.</p>
          <p>Augmentation and Low-F1 Remediation. Upon evaluating the full set of models, we identified 31
species with F1 scores of 0.0. These often had very few training samples. To address this, we performed
targeted data augmentation, generating up to five synthetic variants per audio for the afected species
using pitch shift, time stretch, and background noise injection. Models for these species were then
retrained from scratch.</p>
        </sec>
        <sec id="sec-4-4-2">
          <title>Augmentation for Low-Performance Species</title>
          <p>Despite cross-validation and class balancing eforts, 31 species yielded an F1-score of 0.0. These
typically had very limited training data and insuficient variation for robust generalization. To address
this, we implemented a targeted augmentation pipeline.</p>
          <p>We filtered the training metadata to include only recordings from these 31 underperforming species.
For each original audio file, we synthetically generated five variants using the following transformations:
• Time Stretch (TS): 0.9× and 1.1× the original tempo to simulate temporal variance.
• Pitch Shift (PS): upward and downward shifts of 2 semitones to simulate vocal modulation.
• Gain Adjustment: applied random amplitude scaling to mimic variable recording volumes.
Each augmented file was saved with a filename sufix (e.g., CSA00001_ps_up.ogg), and corresponding
metadata entries were duplicated with updated filenames. These were appended to the original training
set, resulting in a 6× increase in data volume for the afected species.</p>
          <p>Importantly, augmentation was applied only to species that failed to produce any true positives during
validation. This selective approach allowed us to preserve the original distribution for well-performing
classes while enhancing diversity where needed most.</p>
          <p>The retraining of models for these 31 species yielded considerable gains (e.g., plukit1: F1 from 0.00
→ 0.72)
Output and Model Persistence. All trained models were serialized using joblib and stored in
a dictionary mapping species ID to classifier instance. Evaluation metrics including average
crossvalidation AUC-ROC and F1-score were logged to a leaderboard.json file for downstream analysis.</p>
        </sec>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Stage 4: Inference Pipeline</title>
        <p>At inference time, each test recording was segmented into nonoverlapping 5-second audio chunks,
matching the training data setup. For each chunk, we extracted a 3200-dimensional binary vector
representing frequency band activity, using a Short-Time Fourier Transform (STFT) followed by adaptive
thresholding at the 76.7th percentile. These features were fed into all 206 per-species binary classifiers
to obtain classwise probability scores. The output was then assembled into a multilabel submission
matrix compliant with the BirdCLEF 2025 format.</p>
        <p>Each row in the submission corresponded to a specific chunk, indexed using the original filename
and its time ofset (e.g., row_id = soundscape123_10). If a model for a particular species was not
available (due to insuficient data during training), we conservatively assigned a low fixed probability
(0.001) to indicate uncertainty.</p>
        <p>Although local cross-validation results (using a 5-fold stratified CV) showed promising performance,
especially after targeted augmentation in Stage 3, the final leaderboard performance in the hidden
Kaggle test set was lower than expected. This discrepancy may be attributed to the limited coverage
of acoustic contexts in training data, the absence of publicly accessible test recordings during model
development, and the sensitivity of hard-threaded binary features to background noise.</p>
        <p>Interestingly, other teams that used pre-trained embedding extractors such as HuggingFace YAMNet
and leveraged GPU-accelerated deep learning pipelines also reported similarly modest leaderboard
scores. These observations suggest that the BirdCLEF 2025 test distribution may present considerable
domain shift and challenge even robust representation models, reinforcing the dificulty of generalizing
well across diverse ecological soundscapes.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Post-Modeling Results and Analysis</title>
      <p>We evaluated our per-species classifiers using a suite of diagnostic plots to capture key patterns
and failure modes. These include species rankings by F1 score, AUC–F1 heatmaps, precision–recall
scatterplots, and performance gaps based on training data availability. Together, they ofer insight into
model behavior and guide post-hoc correction strategies.</p>
      <sec id="sec-5-1">
        <title>5.1. Top Models by F1 Score</title>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Heatmap of AUC and F1 Scores</title>
        <p>Figure 11 provides a side-by-side view of AUC and F1 scores across species. While some classifiers
perform well across both metrics, others show high AUC but near-zero F1. These cases reveal models
that rank correctly but fail to cross the prediction threshold, often due to class imbalance or label
sparsity.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Precision–Recall Tradeofs</title>
        <p>Figure 9 illustrates trade-ofs between precision and recall. Some models are overly conservative,
achieving high precision but missing positives. Others are more liberal, flagging many positives but
with more false alarms. These patterns can be species-specicfi and help inform threshold adjustments.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. AUC–F1 Gap vs Number of Recordings</title>
        <p>In Figure 12, a clear trend emerges where species with fewer recordings have wider AUC–F1 gaps.
These classifiers learn to rank but struggle to make reliable binary predictions. This highlights the need
for post-training calibration or more training data for low-resource species.</p>
      </sec>
      <sec id="sec-5-5">
        <title>5.5. Taxonomic and Geographic Patterns</title>
        <p>Species from well-represented groups like flycatchers and toucans performed better, likely due to
consistent call structure and richer data. In contrast, species with variable calls or recordings from
acoustically complex environments performed worse. Recordings from xeno-canto also consistently
outperformed those from CSA and iNaturalist, reflecting source quality diferences.</p>
      </sec>
      <sec id="sec-5-6">
        <title>5.6. Data Challenges and Recovery Strategies</title>
        <p>Many species had fewer than 10 usable recordings, limiting their classifier efectiveness. For such
low-resource cases, we applied targeted augmentation including pitch shifts and time stretching. As
shown in Table 2, this approach led to significant improvements in F1 scores for several previously
underperforming species.</p>
      </sec>
      <sec id="sec-5-7">
        <title>5.7. Reflections and Deployment Considerations</title>
        <p>While AUC remains a useful indicator of class-separability, we observed that it often overestimated
real-world deployment readiness especially for species with low base rates or asymmetric errors.
Several models with high AUC-ROC failed to yield usable predictions under fixed thresholds, indicating
that threshold calibration is crucial for practical applications. Precision–recall tradeofs may be more
informative in such low-prevalence settings.</p>
        <p>Another key observation was the performance disparity between locally cross-validated scores and
ifnal leaderboard results. This domain shift highlights the importance of incorporating geographically
and acoustically diverse samples during training, as well as the need for robust evaluation protocols
that mimic deployment-time variability.</p>
        <p>Although our pipeline incorporated metadata (e.g., latitude, collection type), there remains substantial
room for integrating richer context-aware signals such as habitat type, time of day, or ecological
cooccurrence patterns to improve model robustness.</p>
        <p>The augmentation strategy proved beneficial for many underperforming species, yet its impact was
inconsistent, suggesting that augmentation must be both species-specific and acoustically realistic.</p>
        <p>Moreover, augmentation can improve F1 at the expense of AUC, raising questions about which metric
should drive optimization in ecological monitoring contexts.</p>
        <p>Finally, while species-specific classifiers ofer advantages in interpretability and modularity, they
also introduce scalability constraints in large deployments. Future eforts could explore hybrid models
that balance per-species flexibility with shared representations.</p>
        <p>Overall, our framework demonstrates promise for modular bioacoustic classification but reveals open
challenges in threshold tuning, domain generalization, and low-resource adaptation.</p>
      </sec>
      <sec id="sec-5-8">
        <title>5.8 Baseline Comparison with Deep Learning Approaches</title>
        <p>Although our framework relies on modular XGBoost classifiers for per-species detection, we compared
its efectiveness against common deep learning baselines reported by other teams in the BirdCLEF 2025
competition.</p>
        <p>Deep convolutional networks (CNNs), particularly those using EficientNet and ResNet architectures
trained on Mel spectrograms, generally outperformed our approach in leaderboard metrics. For example,
EficientNet-based models trained from scratch (without pre-trained embeddings) achieved public
leaderboard scores as high as 0.613, with private scores around 0.609, outperforming our
XGBoostbased pipeline by a significant margin.</p>
        <p>Transformer models using YAMNet or Audio Spectrogram Transformer (AST) embeddings achieved
performance comparable to ours, with typical scores in the range of 0.55–0.56. However, these models
were more computationally intensive and less interpretable.</p>
        <p>In particular, the top team in BirdCLEF 2025 used a large ensemble of CNNs (including ResNet
and EficientNet) combined with BirdNET embeddings and advanced test-time augmentation. They
achieved a public score of 0.915 and a private score of 0.902, demonstrating the strength of
pretrained audio features and large-scale ensembling.</p>
        <p>Despite the stronger performance of deep CNNs, our approach ofers several practical advantages:
• Interpretability: Each species has a standalone model, enabling targeted error analysis and
confidence calibration.
• Parallelism: Binary classifiers can be trained independently and eficiently in parallel, enabling
scaling to hundreds of species.
• Low compute cost: Our pipeline runs without GPU acceleration and can be deployed in
constrained environments such as edge devices or field sensors.
• Robust augmentation: By applying targeted data augmentation and frequency activation
masking, we improved rare species performance without complex architectures.</p>
        <p>In summary, while EficientNet-based CNNs and BirdNET ensembles achieve higher absolute scores,
our modular XGBoost framework provides a lightweight, interpretable alternative that remains
competitive with transformer baselines, particularly in real-world biodiversity monitoring where interpretability
and modularity are essential.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this work, we presented a modular scalable framework for large-scale bird sound classification based
on the training of one binary classifier per species. This design enabled targeted feature engineering,
easier interpretability, and parallel training, addressing key challenges such as class imbalance and
overlapping species. Through a multiphase development process, from a single-species prototype to a
full 206-species system, we demonstrated the efectiveness of binary frequency encodings and log-mel
statistical features in combination with ensemble classifiers like XGBoost.</p>
      <p>Local cross-validation showed high average performance (AUC-ROC: 0.911), validating the
discriminative power of our approach. However, a notable gap between local results and final leaderboard scores
(public: 0.568, private: 0.561) highlighted the dificulty of generalizing to unseen acoustic environments.
To partially address this, we applied targeted data augmentation, which improved F1 scores for many
low-performing species, although sometimes at the cost of reduced AUC.</p>
      <p>Ultimately, while our system ofers a flexible and interpretable baseline for ecological sound
classification, it also underscores the challenges of domain change, limited training data, and feature
representation in bioacoustic applications. Our findings support the continued development of modular,
per-species strategies while also motivating future integration of pretrained embeddings, adaptive
inference, and domain-aware learning.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Acknowledgment</title>
      <p>We gratefully acknowledge the University of California, San Diego (UCSD) and the Halıcıoğlu Data
Science Institute (HDSI) for providing access to the Data Science/Machine Learning Platform (DSMLP),
which enabled us to eficiently train and evaluate hundreds of species-specific models in parallel. We
are especially thankful to Professor Berk Ustun for his invaluable mentorship, whose insights and
encouragement greatly shaped our modeling approach and analytical thinking. We also extend our
appreciation to our teaching assistant, Ryan Hammonds, for his consistent support, timely feedback,
and technical guidance throughout the project.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT to assist with grammar checking, minor
sentence rephrasing, and improving readability. After using this tool, the authors carefully reviewed
and edited the content to ensure accuracy and originality. The authors take full responsibility for the
content of this publication.
valley, colombia, in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum,
2025.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          , et al.,
          <article-title>Channel-spatial-based few-shot bird sound event detection</article-title>
          ,
          <source>IEEE Transactions on Multimedia</source>
          <volume>25</volume>
          (
          <year>2023</year>
          )
          <fpage>5316</fpage>
          -
          <lpage>5328</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Arik</surname>
          </string-name>
          ,
          <article-title>Unsupervised sound separation using mixit</article-title>
          ,
          <source>in: Proceedings of ICML</source>
          <year>2021</year>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Heinrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lehnert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <surname>Audioprotopnet:</surname>
          </string-name>
          <article-title>An interpretable deep learning model for bird audio classification</article-title>
          ,
          <source>arXiv preprint arXiv:2401.01234</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rauch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>Activebird2vec: End-to-end bird monitoring via transformers</article-title>
          ,
          <source>in: NeurIPS</source>
          <year>2023</year>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chaudhuri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          , Asgir:
          <article-title>Audio spectrogram transformer guided classification and retrieval</article-title>
          ,
          <source>in: Proceedings of ICASSP</source>
          <year>2024</year>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Bird song recognition based on multi-spectral feature fusion using mf-scsenet</article-title>
          ,
          <source>Applied Sciences</source>
          <volume>13</volume>
          (
          <year>2023</year>
          )
          <year>2034</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>A novel bird sound recognition method based on multifeature fusion and transformer</article-title>
          ,
          <source>Sensors</source>
          <volume>23</volume>
          (
          <year>2023</year>
          )
          <fpage>4481</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Michaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lasseck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <article-title>Unsupervised classification and relabeling of bird audio recordings in xeno-canto</article-title>
          ,
          <source>in: CLEF Working Notes</source>
          <year>2023</year>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Multi-label bird species recognition using attention-bigru networks</article-title>
          ,
          <source>Ecological Informatics</source>
          <volume>70</volume>
          (
          <year>2022</year>
          )
          <fpage>101751</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Gebhard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          , L. Kriener,
          <article-title>Metadata-augmented zero-shot bird sound classification using audio spectrogram transformers</article-title>
          ,
          <source>in: Proceedings of NeurIPS 2023 Datasets and Benchmarks Track</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hexeberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chitre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hofmann-Kuhnt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. W.</given-names>
            <surname>Low</surname>
          </string-name>
          ,
          <article-title>Semi-supervised classification of bird vocalizations</article-title>
          ,
          <source>arXiv preprint arXiv:2502.13440</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Segura-Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sturley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Arevalillo-Herraez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Alcaraz-Calero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Felici-Castell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. A.</given-names>
            <surname>Navarro-Camba</surname>
          </string-name>
          ,
          <article-title>5g ai-iot system for bird species monitoring and song classification</article-title>
          ,
          <source>Sensors</source>
          <volume>24</volume>
          (
          <year>2024</year>
          )
          <fpage>3687</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Robb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Maria</surname>
          </string-name>
          , R. May,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <article-title>Oliveira, Long-range bird species identification using directional microphones and cnns</article-title>
          ,
          <source>Machine Learning and Knowledge Extraction</source>
          <volume>6</volume>
          (
          <year>2024</year>
          )
          <fpage>2336</fpage>
          -
          <lpage>2354</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>C.</given-names>
            <surname>You</surname>
          </string-name>
          , et al.,
          <article-title>Large-scale avian vocalization detection delivers reliable global biodiversity insights</article-title>
          ,
          <source>Proceedings of the National Academy of Sciences</source>
          <volume>121</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhatia</surname>
          </string-name>
          , et al.,
          <article-title>The use of birdnet embeddings as a fast solution to find novel sound classes in audio recordings</article-title>
          ,
          <source>Ecology and Evolution</source>
          <volume>14</volume>
          (
          <year>2024</year>
          )
          <article-title>Article 140940</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Adam</surname>
          </string-name>
          , et al.,
          <source>Overview of lifeclef</source>
          <year>2025</year>
          :
          <article-title>Challenges on species presence prediction and identification, and individual animal identification</article-title>
          ,
          <source>in: International Conference of the Cross-Language Evaluation Forum for European Languages</source>
          , Springer,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Cañas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Toro-Gómez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rodriguez-Buritica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Benavides-Lopez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Ulloa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Caycedo-Rosales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Glotin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-P.</given-names>
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Planqué</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of BirdCLEF+
          <year>2025</year>
          <article-title>: Multi-taxonomic sound identification in the middle magdalena</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>