<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Addressing the Challenges of Domain Shift in Bird Call Classification for BirdCLEF 2024</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Emiel Witting</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hugo de Heer</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jefrey Lim</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cahit Tolga Kopar</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kristóf Sándor</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dream Hall</institution>
          ,
          <addr-line>Stevinweg 4, 2628 CN Delft</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>TU Delft Dream Team Epoch</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents Team Epoch IV's solution to the BirdCLEF 2024 competition, which focuses on developing machine learning models for bird call recognition. The primary challenge in this competition is the significant domain shift between the Xeno-Canto recordings used for training and the passive acoustic monitoring (PAM) soundscapes used for testing. This shift poses dificulties due to diferences in recording equipment, recording conditions, and background noise, which complicates accurate species identification. We delve into the specifics of this domain shift, quantifying its impact on model performance, and we propose methods to mitigate its efects. Our approach includes a comprehensive set of data augmentations and pre- and postprocessing techniques to enhance model robustness and generalization. We performed extensive experiments to verify the efectiveness of these methods. Our findings provide a foundation for future work in addressing domain shift challenges in bioacoustic monitoring, contributing to more accurate and reliable biodiversity assessments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Bird Species Classification</kwd>
        <kwd>Domain Shift</kwd>
        <kwd>Domain Adaptation</kwd>
        <kwd>Convolutional Neural Networks</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>Passive Acoustic Monitoring</kwd>
        <kwd>Kaggle Competition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        BirdCLEF 2024 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is a Kaggle competition aimed at advancing machine-learning solutions for bird
call recognition, as part of LifeCLEF [
        <xref ref-type="bibr" rid="ref2 ref9">2</xref>
        ]. The primary task involves developing data processing
techniques and models to identify bird species from continuous audio recordings, specifically targeting
under-studied Indian bird species in the Western Ghats. This competition holds value for biodiversity
monitoring, as it leverages PAM to facilitate extensive and temporally detailed surveys, contributing to
conservation eforts.
      </p>
      <p>Participants face several notable challenges, primarily centred around the domain shift between the
training data and test soundscapes. One of the main hurdles is the diference between the Xeno-Canto
recordings used for training and the PAM soundscapes used for testing. This shift is exacerbated by the
fact that Xeno-Canto recordings are not expert-labelled and do not provide labels for each five-second
segment, but rather for the entire file. This lack of precise labelling makes it challenging to handle
secondary labels accurately. The absence of PAM data in the training set poses a significant obstacle.
Participants must develop models without having access to the same type of labelled data on which
their models will be evaluated, which necessitates innovative approaches to generalize efectively.
Additionally, the competition imposes a strict inference time limit of two hours on a CPU, requiring
eficient algorithmic implementations.</p>
      <p>
        This paper presents Team Epoch IV’s solution [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to the BirdCLEF 2024 competition, with a primary
focus on analyzing and addressing the domain shift challenge. We delve into the specifics of this shift
and examine its impact on the discrepancy between local cross-validation scores and the public and
private leaderboard scores. Our approach includes a detailed exploration of methods to mitigate these
diferences and enhance model performance across varied data domains.
      </p>
      <p>The paper is structured as follows: Section 2 describes our implementation strategy, including
environmental setup, data preprocessing, data augmentation, model selection, and postprocessing
techniques. Section 3 discusses the domain shift between training and test data. Section 4 presents our
experiments and results, including an ablation study and seed stability analysis. Section 5 discusses our
ifndings, and Section 6 concludes with future work and acknowledgements.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Implementation</title>
      <p>In this section, we detail our implementation strategy employed for our participation in the BirdCLEF
2024 competition. Our approach encompasses environmental and training setup, data preprocessing,
data augmentation, model selection, and postprocessing techniques.</p>
      <sec id="sec-2-1">
        <title>2.1. Environmental setup</title>
        <p>
          During the competition, we collaborated as a team. Instead of working in notebooks, which does not
allow for streamlined collaboration, we developed and used our machine learning framework Epochalyst
[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. This package contains many modules and classes extracted from previous Epoch [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] competition
experience to start new competitions quickly. Epochalyst makes use of hydra to load in configuration
.yaml files that specify full training or ensemble runs and instantiate elements directly into Python
objects for eficient development. We used Rye [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] for project &amp; package management and designed a
custom lazy loading multiprocessing pipeline for loading audio using Dask [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and Librosa [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. PyTorch
[9] was utilized as the main framework for training, with additional libraries such as Timm [10] for
using various 2D Convolutional Neural Network architectures. Additionally, for an extra ~2× inference
speed up, ONNX [11] and OpenVINO [12] were used to maximise performance. Models were trained
on on-site hardware running Linux [13], specifically on PCs running AMD Ryzen 9 7950X 16-Core
Processor (96GB RAM) with an NVIDIA RTX A5000 GPU using Python 3.10.13. Model training and run
artefacts were logged on Weights &amp; Biases [14] to keep a clear overview of all of our experiments.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Data Preprocessing</title>
        <p>The BirdCLEF 2024 training dataset consists of 24459 audio .ogg files uploaded by users of Xeno-Canto
[15], consisting of 182 diferent bird species. All the training audio has been resampled to 32 kHz to
match the test soundscape sampling rate. We did not obtain improved results pretraining on more data
from previous BirdCLEF competitions, therefore we only used this year’s data for our final submission.
For training, we used a 5-fold CV with a stratified split based on the primary label of the audio file. This
ensures that the species are equally represented in each fold. Taking the first 5 seconds of each audio file
seemed to be optimal since the bird calls of the recordings had a higher probability of appearing early
in the uploaded recordings. Some Xeno-Canto files also contained secondary labels of bird species that
appeared in addition to the primary bird. For these, we set the secondary labels to 0.5 and the primary
labels to 1, because the primary birds were consistently more audible in the audio files compared to the
secondary birds.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Data Augmentation</title>
        <p>We have implemented several data augmentation techniques to increase the robustness of our models and
to address the domain shift between the training data and the test soundscapes. Our full augmentation
pipeline can be seen below. Some of the augmentations are 1D, which means they are applied to the
raw audio signal. Afterwards, we converted the signal to Mel spectrograms of 256 × 256 pixels, with a
frequency range of 1Hz to 16kHz, which are then further normalized so that all values are in the range
of 0 to 1. As a last step in our custom dataset, some 2D augmentations are applied.
• Randomly shifting the phase of each frequency component of the signal with  = 0.5 and a
shift_limit = 0.5.12
• Randomly shifting the amplitude of each frequency component of the signal with  = 0.5.
• MixUp [16] with  = 0.5:</p>
        <p>Linearly interpolating both features and labels of two samples, with random weights.
• CutMix [17] with  = 0.5:</p>
        <p>Randomly cropping and replacing part of a sample with another sample. The labels are averaged
linearly with weights proportional to the length/area of each sample.</p>
        <p>• CutMix with  = 0.5.</p>
        <p>
          Figure 1 above visualizes our augmentation pipeline. The phase shift aims to simulate background
noise in bird regions, in an efort to reduce the shift between the clear training examples and the noisy
soundscapes. Amplitude shift amplifies diferent frequencies in the signal domain to enhance robustness
against bird volume variance, since there are birds in the soundscapes that are in close proximity to
the recording location whilst others might be located further away. Afterwards, the CutMix1D and
MixUp1D are applied in the signal domain to improve learning when there are multiple birds in the
same audio file, which is a common occurrence for the soundscapes. Finally, a CutMix2D is applied
after converting the previous pipeline to a Mel spectrogram. An ablation study of these augmentations
can be found in section 4.1.
1This does not influence the magnitude spectrum taken over the whole recording, but when windowed magnitude spectra are
extracted it has the efect seen in Figure 1
2shift_limit in the range [
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          ] corresponds to a phase shift of [0,2 ]
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Models</title>
        <p>We mainly used Timm [10] for straightforward model development where we experimented with
various architectures. Some of the best encoders that we have found were: convnext_tiny and
eca_nfnet_l0. The convnext_tiny model got the highest public leaderboard score of 0.701 while
we observed it being more unstable over multiple submitted training runs. eca_nfnet_l0, on the
other hand, had a slightly lower public score of 0.688 but we found it to be a more stable model during
experimentation. We decided to submit these two models: the more stable one and the less stable
one but with a higher public score. During training for 50 epochs total with an initial learning rate
of 1− 4, Binary Cross-Entropy [18] loss was used with an AdamW [19] optimizer. A single cycle
CosineAnnealing learning rate scheduler was employed with a slight warmup of 2 epochs to ensure
initial stability. Furthermore, models had a sigmoid activation function to ensure that the logits ranged
between 0 and 1. Local evaluation was done on every 5 seconds of each file, using the AUROC [ 20]
metric where we observed a significant shift between our local scores and public scores. We were able
to optimize our local scores to ~0.995 AUROC, by adding dropout, training on multiple datasets from
previous BirdCLEF 3 competitions and including additional augmentations. However, any optimization
above ~0.98 local, caused the public score to drop significantly. Therefore observing an overfitting
pattern on Xeno-Canto data while reducing the performance on the soundscapes, proposing us to focus
on minimizing the shift and not optimizing on training data.</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.5. Postprocessing</title>
        <p>The test soundscapes are 4 minutes long, where we have to predict for each 5-second window resulting
in 48 predictions per soundscape. We calculate the mean bird species probability per soundscape over
the 48 windows and multiply each individual prediction by the mean of the soundscape it is in. The
reasoning behind this was that we saw that birds usually appear multiple times per recording, so the
mean should be high for birds that are truly in the audio. This improved our scores consistently with
~0.02 on public and private.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Domain shift</title>
      <p>The training data for this competition was sourced from a diferent domain than the test set for which
the models are intended. This poses a problem that is highly relevant in non-competition settings,
where (labelled) test data is not always available. The impact quickly becomes obvious, when observing
that models can achieve above 0.99 AUROC on held-out training data, but score below 0.70 AUROC on
the test set. In this section, we will hypothesize components of the discrepancy, guided by statistical
analysis, knowledge about the data source, and visual inspection. Furthermore, we attempt to quantify
the domain shift and measure the impact of techniques to mitigate the shift.</p>
      <sec id="sec-3-1">
        <title>3.1. Mapping the datasets</title>
        <p>We explored the train dataset and looked for diferences with the unlabeled soundscapes, by making a
visual overview. To ensure that we organize the audio in the way our model perceives it, we compared
the activations of the last hidden layer of a baseline model, instead of the raw input. For both domains,
the first five seconds of one thousand unique recordings were fed through the model. The activations
were then projected using UMAP[21] onto ℛ2. This shows that there is partial overlap between the
domains, and part of the test domain seems completely outside of the training distribution (Figure 2a).</p>
        <p>We then plotted the corresponding spectrograms at the positions of their UMAP embedding. This
allowed us to understand and identify the diferent regions of the dataset manually, as shown in Figure
3. Quadrants I and II contain mostly no-calls. It makes sense for these to be outside of the training
distribution, which should only have labelled bird calls. The diference appears to be that II is fully
3BirdCLEF 2020, 2021,2022 and 2023 Xeno-Canto and labelled soundscape data retrieved from Zenodo.
(a) Distribution shift
(b) Top-20 known labels
quiet, or at least has uniform noise, whereas I has noisy recordings with sounds other than birds.
Most bird calls appear in III and IV. III Contains mostly training data, which seems to be characterized
by low-noise, high-contrast bird call recordings. Towards region IV there is a gradient of increasing
amounts of test data. This seems characterized by high background-noise images with less contrast
(the background looks consistently brighter) and horizontal stripes. We assume the horizontal stripes
are likely insects, such as cicadas. Note that there are in fact some training samples that fit into this
distribution, as can be seen in both Figure 2a and in the bottom right of 3.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Shift causes</title>
        <p>It is important to be cautious about assuming the nature of the problem. High train performance and
low test performance on another domain seem to warrant (unsupervised) domain adaptation, to solve
the apparent problem of domain shift. Well-documented forms include covariate shift, prior shift, or
concept shift [22]. These might be approached with feature-based sample weighting, class-based sample
weighting, and deep domain adaptation methods respectively. However, all of these forms rely on the
assumption that there is only one type of shift at a time and that some other factor remains constant.
Furthermore, it is possible that generalisation is not an issue, and that the drop in score can be explained
solely by the fact that the test domain is just uniformly more dificult and ambiguous.</p>
        <p>With this caution in mind, we hypothesized three main contributors for the drop in score:
1. The models underperform on no-call audio, which the Xeno-Canto training data does not contain.
2. The PAM test data are inherently more dificult and ambiguous to classify, even for models trained
on it.
3. The PAM test data is shifted into feature distributions that our model has not encountered or
generalized to properly during training.</p>
        <p>We exclude prior shift, or label balance, as a root cause. This is because the scoring metric is mostly
class-balance invariant.</p>
        <p>Hypothesis 1 was confirmed by measuring the predictions on test samples from regions I and II
that we confirmed to be no-calls. We observed that our model was consistently making predictions
at around 0.5 ∼ 0.6 confidence. Those false positives occur across a handful of species. Mostly
browowl1, comior1 and comkin1 for region I, and woosan for region II. We estimate this partly being
due to a random bias, and correlated background noise. We have seen false browowl1 positives for
several models, possibly due to those samples being recorded for a nocturnal species with mostly quiet
recordings and sometimes insects.</p>
        <p>Hypothesis 2 allows the possibility that test-like data is in fact represented in the train data, but at
lower proportions. We might approximate the dificulty for the PAM-like samples, by measuring train
scores in region IV. This resulted in a class-mean AUROC of 0.985. This is clearly significantly higher
than the leaderboard test score, also when regarding the small sample size (154 samples with 75 unique
species). This evidence contradicts hypothesis 2. A reason for not rejecting it fully is that those train
samples might not be representative of test data, and that they difer along a dimension that is not
captured by UMAP.</p>
        <p>Hypothesis 3 is the standard problem of models not generalizing to data that is outside of the training
distribution. The main diferences we noticed visually were the decreased contrast (low signal-to-noise
ratio) and horizontal stripes from possibly insects. Furthermore, we observed more overlapping bird
calls in the test soundscapes than in train audio. A participant in BirdCLEF 2023 mentioned reverb
[23]. This might also be the case, although we have not had the opportunity to verify this or test reverb
augmentations.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Shift mitigation</title>
        <sec id="sec-3-3-1">
          <title>3.3.1. Call or no call classification</title>
          <p>To mitigate the fact that our model underperforms on no-call audio, a two-stage pipeline was introduced.
The first stage consists of training a model on the Freefield1010 [24] dataset to perform a binary
classification task for every 5-second window to predict if there is a call or no call from birds, a f1-score
of 0.810 was achieved for stage one. If the first stage predicted that there was a no call in a 5-second
window of a soundscape, all predictions were set to 0 and the second stage for this window was skipped,
also saving important inference time. After empirical visual inspection of the soundscapes, the two-stage
appear to be correct for silent soundscapes with an example in Figure 9 and 10 illustrating predictions of
our best submission compared to our two-stage approach on a silent soundscape. Interestingly, against
our expectations, our public scores did not improve by submitting our two-stage approach. Further
investigation with the labels of the test soundscapes is recommended to detect where our two-stage
model is making its mistakes.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. Test Audio Scaling</title>
          <p>In order to remove the shift between distributions as mentioned in hypothesis three, we tried two
techniques. The first is to scale down the test audio during inference. Because we min-max scaled our
spectrograms, this is analogous to increasing  in the logarithmic scaling log( +  ) that we applied
to spectrograms. We scale the audio down by a factor of 1/100, which we found through empirical
experimentation. The efect is an increased contrast, which visually makes the test data look more similar
to the train data. We consistently achieved higher scores for both the public and private leaderboards
as a result.</p>
          <p>(a) Factor 1
(b) Factor 1/100
(c) Factor 1/500</p>
        </sec>
        <sec id="sec-3-3-3">
          <title>3.3.3. Frequency-based noise removal</title>
          <p>The second technique aims to remove ambient noise. It treats the audio as a sum of infrequent (bird)
noises, and background noise that is stronger in some frequencies than others, but is constant over time.
Visually, this means removing all horizontal stripes from spectrograms.</p>
          <p>To obtain a robust estimate of the background noise level per frequency, the quantile  = 0.25 was
used per row of the spectrogram. This implictly assumes that a bird call does not occupy the same
frequency for more than three-quarters of the sample. If that assumption is true, the value will not be
impacted by outliers from bird calls, which the mean would be sensitive to. This is then subtracted from
the original image, an example is shown in figure 5.</p>
        </sec>
        <sec id="sec-3-3-4">
          <title>3.3.4. Domain distance</title>
          <p>To quantify to which extent the two domains were becoming more similar, we used a modified Frechet
Inception Distance [25]. FID compares distributions of the activations of the last hidden layer of an
Inception-v3 network between two datasets. We use the activations of our best-selected submission
model instead. Because the goal was not to remove the discrepancy between call and no-call, that
separation needed to be preserved, only regions III and I were used. We hypothesised that this could be
used to estimate the impact of a shift mitigation technique, before training a model and evaluating it
with test labels.</p>
        </sec>
        <sec id="sec-3-3-5">
          <title>3.3.5. Deep Domain Adaptation</title>
          <p>Some methods are developed specifically to mitigate the domain shift while training a deep learning
model. Examples are DANN [26] and MDD [27]. They take labelled train data and unlabeled test
samples. We did not have success with these techniques, however. In part, this was due to the dificulty
of tuning hyperparameters for the adversarial networks that they contain, which are prone to instability.
An argument for why these techniques might not be suitable at all without modification is that their
objective includes removing as much diferences between train and test as possible. This could cause
issues when a major diference is the existence of no-calls only in test. This might mean that optimising
the objective requires removing the ability to tell bird calls from silence.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>This section contains experiments regarding our ablation study in section 4.1 and seed stability
experiments in section 4.2. To verify the significance of improvements in the ablation study, we also
investigated the efect of randomness on the leaderboard performance. For this, experiments with the
same configuration but with various seeds that afect the data split, augmentations, and model weight
initialization the stability of our models have been trained and evaluated.</p>
      <sec id="sec-4-1">
        <title>4.1. Ablation Study</title>
        <p>In our ablation study, we aim to analyse the efect of our augmentations. We performed 6 training
runs of our best submission model with 5-fold CV, where we added one augmentation at a time. We
submitted each fold individually, resulting in 30 total submissions. From Figure 6 we can observe a very
high score variance within folds and a positive correlation of 0.85 between public and private. The
Leaderboard group in the boxplot contains a weighted average of the public and private score and is
calculated as follows: 35% Public Score + 65% Private Score. MixUp1D resulted in the most significant
improvement of ~0.03 on the Kaggle leaderboard.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Seed Stability</title>
        <p>For our seed stability experiment, we used the model pipeline from our best submission and retrained
that model 5 times with diferent seeds for 5-fold CV each. We made late submissions on Kaggle for
each fold, resulting in a total of 25 variations of the same model. The public and private leaderboard
scores of these models are shown in figures 7a and 7b.
(a) Distribution of Public &amp; Private LB Scores</p>
        <p>(b) Correlation between Public &amp; Private LB Scores: 0.25</p>
        <p>There is a significant variance for both the public and private leaderboard scores with the same
model. More specifically, the public leaderboard scores have a standard deviation of 0.01197 and the
private leaderboard scores have a standard deviation of 0.01091. Furthermore, we can observe a slight
correlation of 0.25 between public and private leaderboard scores by only varying the seed of the run
configuration.</p>
        <p>It is worth noting that while various assembling techniques can be used to stabilize models, this is
not always possible due to the strict CPU inference time limit.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Shift mitigation</title>
        <p>To evaluate the efect of shift mitigation using the test-time scaling and frequency-based filter, 5
submissions were made for each, along with a baseline. For the frequency normalized the model was
trained again with the filter applied, but the test scaling was performed at inference time with the
original model weights. Again, these 5 submissions are the results of training on a diferent fold. The
results are shown in 8, the score is calculated as 35% Public Score + 65% Private Score. Both methods
improve the baseline, audio scaling most significantly.</p>
        <p>For these three techniques, we measure the distance between train and test distributions as described
in section 3.3.4. This was measured only for data that was in regions III and IV in the original UMAP
projection, with the same model that generated figure 2. For the baseline, the modified FID was 41.4,
by applying the frequency-based filter to all data, the distance shrank to 37.6. When instead rescaling
only the test audio by 1/100, it increased to 46.7.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <sec id="sec-5-1">
        <title>Augmentations</title>
        <p>From our ablation study in section 4.1 we found that MixUp1D was one of the best-performing
augmentations. We suppose this is due to the soundscapes consisting of more birds simultaneously compared
to our training data. MixUp increases the models’ capability of learning diferent birds at the same
time. The CutMix augmentations did not result in a significant improvement, after further analysis
we observed that CutMix often cuts out a bird call and replaces it with a silent section of another bird
audio fragment, therefore confusing the model with learning it to annotate a silence with a bird call.
We consider that the PhaseShift augmentation was set to be too extreme, therefore adding too much
noise and reducing the models’ capacity to learn the visual bird call patterns. On the contrary, the
AmplitudeShift performance was inconclusive and we suggest tuning it to higher intensity to have
more efect.</p>
      </sec>
      <sec id="sec-5-2">
        <title>Seed stability</title>
        <p>An interesting observation was that we found that our models were in general quite unstable. During
our experimentation we did not find any way of locally evaluating our models, therefore we relied on
the public leaderboard. In our experiments specified in section 4.2, we perceived that the same model
configuration trained on a diferent seed could significantly change the public and private leaderboard
performance. Therefore, indicating that randomness was highly involved during our experimentation
phase. During this phase, we implemented novel ideas and the only way we found to evaluate the idea
was by making a submission. However, we might have discarded ideas that got an ’unlucky’ low public
score due to the elevated randomness, which was better if we had analysed the average over multiple
submissions. Furthermore, it is also interesting to note that the public and private scores have a slight
correlation on submitting diferent seeds, which could indicate that optimizing the seed on the public
leaderboard could also transfer to a higher private score.</p>
      </sec>
      <sec id="sec-5-3">
        <title>Shift mitigation</title>
        <p>Visualizing and interacting with the datasets helped us understand the diferences between train and
test. This guided us towards two techniques that visually removed the shift and also improved scores
on the test set. However, we are not certain about the impact of the no-call segments. While it seemed
that there were many false positives, the two-stage approach did not solve this problem. This might be
caused by not optimizing the two-stage model suficiently, but it is also surprising that out of the top 5
solutions for BirdCLEF 2024, nobody seemed to get consistent improvements by using a call/no-call
model, even though it has been used successfully in other editions. Furthermore, it is not clear what
proportion of the score drop is caused by the shift in regions III and IV as in figure 3, or false positives.
We are not able to confirm this without access to the labels.</p>
        <p>Unfortunately, decreasing the FID distance between domains does not guarantee an improved test
score and vice-versa. While the distance only increased for scaling and decreased for the filter, the
score improved for both. The distance change might be explained by the fact that scaling was only
applied to the test data, which introduces a synthetic shift that can be measured but does not negatively
impact the model. The frequency-based noise filter was applied on both datasets instead of only one
and removed shift distance as intended.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we presented our solution for the BirdCLEF 2024 competition, focusing on the challenge
of domain shift between the training and test datasets. Our approach includes shift mitigation through
data augmentation and preprocessing. We evaluated the stochasticity of the results and performed
experiments with thorough 5-fold validation.</p>
      <p>We found that:
• Domain Shift Mitigation: Applying frequency-based noise removal and scaling test samples, as
guided by exploratory data analysis, proved successful. This introduces a novel filter that could
be applicable to other PAM audio classification problems and generally highlights the importance
of investigating the aspects of domain shift.
• Data Augmentation: MixUp1D applied to audio is a particularly efective technique. This is
likely due to the presence of multiple bird calls per recording in the test soundscapes. Other
augmentations such as CutMix and PhaseShift did not yield improvements, these might need to
be adapted or excluded in similar experiments.
• Seed stability: Our seed stability experiments revealed substantial variance in public and private
leaderboard scores, emphasizing the impact of randomness in model training. This underscores
the importance of averaging results over multiple seeds to obtain a reliable performance estimate.
It also implies that conclusions based on the competition outcome should be drawn with caution
if scores are close.
• Call/No Call Classification: Implementing a two-stage pipeline for call/no call classification did
not result in the expected score improvements. This suggests that our current implementation
may need refinement or that the issue of false positives might be more complex than anticipated.</p>
      <p>Overall, our findings indicate that addressing domain shift is crucial for achieving robust performance
in bird call classification tasks. Our methods provide a foundation for future work, including further
refinement of data augmentation techniques, deeper analysis of domain shift, and more sophisticated
model evaluation strategies.</p>
      <p>In conclusion, while we achieved notable improvements, the BirdCLEF 2024 competition highlighted
the ongoing challenges in developing models that generalize well across diferent acoustic environments.
Our results underscore the need for continuous innovation and experimentation in tackling domain
shift and enhancing model robustness.</p>
      <sec id="sec-6-1">
        <title>6.1. Future work</title>
        <p>Looking ahead, several avenues for future research and experimentation emerge from our findings.
First of all, during our experimentation phase, we might have discarded ideas because they were based
on a single submission. Due to the observed impact of randomness on the scores, several ideas that
were initially discarded are worth investigating again, including:
• Alternating Ensemble: For every 4-minute soundscape, corresponding to 48 windows, let 
diferent models predict the windows alternately, followed by averaging neighbouring window
predictions. In this way, we are able to ensemble without increasing inference time.
• Pretraining on previous years: Rerun experiments where we append BirdCLEF data including
soundscape PAM data from Zenodo.
• Two-stage refinement: Refining the two-stage pipeline approach and incorporating more
sophisticated methods for distinguishing between bird calls and background noise in addition to the
freefield1010 dataset.
• Longer Window Models: Experimenting with models that process longer audio windows (e.g., 10
seconds) could provide more context and improve classification accuracy.
• Multi-Channel Spectrograms: Investigating the use of multi-channel spectrograms to capture
richer audio information.</p>
        <p>Secondly, obtaining and analyzing the labels of the test soundscapes would allow us to validate our
hypotheses about domain shift and the efectiveness of our mitigation techniques. Finally, we encourage
the competition hosts to analyze the best solutions to the competition again. It could be very insightful
to measure the score when excluding no-calls, or excluding overlapping bird calls, to isolate the efects.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>We would like to thank the organizers of the BirdCLEF 2024 competition and all involved institutions.
We extend our thanks to all the participants of the BirdCLEF 2024 competition who were active in the
Kaggle discussion forums for their ongoing eforts in advancing the field of bioacoustics and biodiversity
monitoring. Your dedication and collaboration are instrumental in driving forward conservation eforts
worldwide.
[9] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,
L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S.
Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch: An imperative style,
highperformance deep learning library, in: Advances in Neural Information Processing
Systems 32, Curran Associates, Inc., 2019, pp. 8024–8035. URL: http://papers.neurips.cc/paper/
9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
[10] R. Wightman, Pytorch image models, https://github.com/rwightman/pytorch-image-models, 2019.</p>
      <p>doi:10.5281/zenodo.4414861.
[11] O. R. developers, Onnx runtime, https://onnxruntime.ai/, 2021. Version: 1.17.3.
[12] I. Corporation, Openvino, https://docs.openvino.ai/, 2018. Open-source toolkit for optimizing and
deploying deep learning models.
[13] Canonical Ltd., Ubuntu 23.10 (mantic minotaur), https://releases.ubuntu.com/23.10/, 2023.
Operating system release.
[14] L. Biewald, Experiment tracking with weights and biases, https://www.wandb.com/, 2020. Software
available from wandb.com.
[15] X. canto Foundation, Xeno canto, https://xeno-canto.org/, 2005.
[16] H. Zhang, M. Cisse, Y. N. Dauphin, D. Lopez-Paz, mixup: Beyond empirical risk minimization, in:
International Conference on Learning Representations, 2018. URL: https://openreview.net/forum?
id=r1Ddp1-Rb.
[17] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, Y. Yoo, Cutmix: Regularization strategy to train strong
classifiers with localizable features, CoRR abs/1905.04899 (2019). URL: http://arxiv.org/abs/1905.
04899. arXiv:1905.04899.
[18] P. Shukla, How did binary cross-entropy loss come into existence?, Towards AI (2023).
[19] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, arXiv preprint arXiv:1711.05101
(2017).
[20] T. Fawcett, An introduction to roc analysis, Pattern recognition letters 27 (2006) 861–874.
[21] L. McInnes, J. Healy, J. Melville, Umap: Uniform manifold approximation and projection for
dimension reduction, arXiv preprint arXiv:1802.03426 (2018).
[22] J. G. Moreno-Torres, T. Raeder, R. Alaiz-Rodríguez, N. V. Chawla, F. Herrera, A unifying view
on dataset shift in classification, Pattern Recognition 45 (2012) 521–530. URL: https://www.
sciencedirect.com/science/article/pii/S0031320311002901. doi:https://doi.org/10.1016/j.
patcog.2011.06.019.
[23] M. Lasseck, Bird species recognition using convolutional neural networks with attention on
frequency bands, CEUR Workshop Proceedings (2023). URL: https://www.CEUR-WS.org/vol-3497/
paper-175.pdf. doi:https://doi.org/10.1016/j.patcog.2011.06.019.
[24] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, M. D. Plumbley, An open dataset for research
on audio field recording archives: freefield1010, arXiv:1309.5275, 2013. doi: 10.48550/arXiv.
1309.5275.
[25] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, S. Hochreiter, Gans trained by a two time-scale
update rule converge to a local nash equilibrium, 2018. arXiv:1706.08500.
[26] A. Sicilia, X. Zhao, S. J. Hwang, Domain adversarial neural networks for domain generalization:</p>
      <p>When it works and how to improve, 2022. arXiv:2102.03924.
[27] J. Li, E. Chen, Z. Ding, L. Zhu, K. Lu, H. T. Shen, Maximum density divergence for domain
adaptation, IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (2021) 3918–3930.</p>
      <p>URL: http://dx.doi.org/10.1109/TPAMI.2020.2991050. doi:10.1109/tpami.2020.2991050.</p>
    </sec>
    <sec id="sec-8">
      <title>A. Extra visualizations</title>
    </sec>
    <sec id="sec-9">
      <title>B. Results</title>
      <p>This section contains additional raw results from our experiments.</p>
      <sec id="sec-9-1">
        <title>B.1. Seed Stability</title>
        <p>Public LB
0.635002
0.610959
0.630144
0.620723
0.598396
0.626281
0.605735
0.613994
0.611792
0.634171
0.629907
0.590896
0.627677
0.631077
0.627975
0.650119
0.626682
0.633204
0.636626
0.629955
0.644002
0.659341
0.665605
0.675107
0.683710
0.657769
0.659733
0.648198
0.659187
0.66703</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Srivathsa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Arvind</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>CP</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sawant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. V.</given-names>
            <surname>Robin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Glotin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-P.</given-names>
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Planqué</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of BirdCLEF 2024:
          <article-title>Acoustic identification of under-studied bird species in the western ghats</article-title>
          ,
          <source>Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Espitalier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Botella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Deneu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Marcos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Estopinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Šulc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hrúz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          , et al.,
          <source>Overview of lifeclef</source>
          <year>2024</year>
          :
          <article-title>Challenges on species distribution prediction and identification</article-title>
          ,
          <source>in: International Conference of the CrossLanguage Evaluation Forum for European Languages</source>
          , Springer,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lim</surname>
          </string-name>
          , E. Witting, H. de Heer, C. T. Kopar,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sándor</surname>
          </string-name>
          , Identifying Bird Calls,
          <year>2024</year>
          . URL: https: //github.com/TeamEpochGithub/iv-q4
          <string-name>
            <surname>-</surname>
          </string-name>
          birdclef-
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lim</surname>
          </string-name>
          , E. Witting, H. de Heer, C. T. Kopar,
          <string-name>
            <surname>J. van Selm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ebersberger</surname>
          </string-name>
          , G. Dumont,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sándor</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. De Dios Allegue</surname>
          </string-name>
          , Epochalyst,
          <year>2024</year>
          . URL: https://github.com/TeamEpochGithub/epochalyst.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lim</surname>
          </string-name>
          , E. Witting, H. de Heer, C. T. Kopar,
          <string-name>
            <surname>J. van Selm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ebersberger</surname>
          </string-name>
          , G. Dumont,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sándor</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. De Dios Allegue</surname>
          </string-name>
          ,
          <year>2024</year>
          . URL: https://teamepoch.ai/.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ronacher</surname>
          </string-name>
          ,
          <source>Rye: a Hassle-Free Python Experience</source>
          ,
          <year>2024</year>
          . URL: https://rye.astral.sh/.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Dask</surname>
            <given-names>core developers</given-names>
          </string-name>
          ,
          <source>Dask | Scale the Python tools you love</source>
          ,
          <year>2024</year>
          . URL: https://www.dask.org/.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>McFee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>McVicar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Faronbi</surname>
          </string-name>
          , I. Roman,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gover</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Balke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Seyfarth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Malek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Raffel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Lostanlen</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. van Niekirk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Cwitkowitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zalkow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Nieto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ellis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mason</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Steers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Halvachs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Thomé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Robert-Stöter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bittner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Weiss</surname>
          </string-name>
          , E. Battenberg,
          <string-name>
            <given-names>K.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yamamoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Carr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Metsai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sullivan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Friesch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krishnakumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hidaka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kowalik</surname>
          </string-name>
          , F. Keller, D. Mazur,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chabot-Leclerc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hawthorne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ramaprasad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Keum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Monroe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. A.</given-names>
            <surname>Morozov</surname>
          </string-name>
          , K. Eliasi, nullmightybofo, P. Biberstein,
          <string-name>
            <given-names>N. D.</given-names>
            <surname>Sergin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hennequin</surname>
          </string-name>
          , R. Naktinis, beantowel, T. Kim,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Åsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Malins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hereñú</surname>
          </string-name>
          , S. van der Struijk, L. Nickel,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vollrath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sarrof</surname>
          </string-name>
          ,
          <string-name>
            <surname>Xiao-Ming</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Porter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kranzler</surname>
            , Voodoohop,
            <given-names>M. D.</given-names>
          </string-name>
          <string-name>
            <surname>Gangi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Jinoz</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Guerrero</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mazhar</surname>
            , toddrme2178,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Baratz</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Kostin</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Zhuang</surname>
            ,
            <given-names>C. T.</given-names>
          </string-name>
          <string-name>
            <surname>Lo</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Campr</surname>
            , E. Semeniuc,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Biswal</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Moura</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Brossier</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
          </string-name>
          , W. Pimenta, librosa/librosa: 0.10.2.post1,
          <year>2024</year>
          . URL: https://doi.org/10.5281/zenodo.11192913. doi:
          <volume>10</volume>
          .5281/zenodo.11192913.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>B.</surname>
          </string-name>
          <year>2</year>
          . Ablation study
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>