<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Transfer Learning with Semi-Supervised Dataset Annotation for Birdcall Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anthony Miyaguchi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nathan Zhong</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Murilo Gustineli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chris Hayduk</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Georgia Institute of Technology</institution>
          ,
          <addr-line>North Ave NW, Atlanta, GA 30332</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present working notes on transfer learning with semi-supervised dataset annotation for the BirdCLEF 2023 competition, focused on identifying African bird species in recorded soundscapes. Our approach utilizes existing of-the-shelf models, BirdNET and MixIT, to address representation and labeling challenges in the competition. We explore the embedding space learned by BirdNET and propose a process to derive an annotated dataset for supervised learning. Our experiments involve various models and feature engineering approaches to maximize performance on the competition leaderboard. The results demonstrate the efectiveness of our approach in classifying bird species and highlight the potential of transfer learning and semi-supervised dataset annotation in similar tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Transfer Learning</kwd>
        <kwd>Dataset Annotation</kwd>
        <kwd>BirdNET</kwd>
        <kwd>Bird-MixIT</kwd>
        <kwd>CEUR-WS</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Embedding Space and Transfer Learning</title>
      <p>
        BirdNET [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] classifies 48kHz 3-second audio clips into 3337 classes using a convolutional neural
network trained on scaled spectrograms computed by the short-time Fourier transform. The
classes are primarily composed of bird species but include non-bird classes such as environmental
noise or human voices. We obtain embedding tokens by taking the values at the second-to-last
layer of the model, preceding a fully connected logit layer. The embedding maps the audio in the
time domain into a vector space ℛ320 that roughly preserves distances between points. We use
embedding tokens as features in a supervised machine-learning model to take advantage of the
compact representation of the audio data. We show a clear separation between the embedding
tokens by running them through a dimensional reduction technique in figure 1. Clustering
supports a hypothesis that we efectively utilize representation learned by BirdNET in new
contexts.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Semi-Supervised Dataset Annotation</title>
      <p>
        We propose a process to derive an annotated dataset to fit data to traditional supervised learning
algorithms. First, we chunk audio within the training examples so that no track lasts 3 minutes
by recursively splitting the tracks until they are smaller than our threshold, padded to the
nearest third second with additive white noise. Chunking the audio solves the problem of batch
processing skew introduced by several examples that are longer than 30 minutes. We assume
the upper bound on the track length is suficient to model temporal dependencies. We use
Bird-MixIT [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to isolate environmental noise and bird vocalizations. We then process each of
the tracks using BirdNET to extract an embedding vector in R320 and a prediction logit vector
in R3337 for every 3-second interval over a 1-second sliding window. Each interval is labeled
with top prediction labels and the energy of the original track. We save the results from each
track to disk and consolidate them into a parquet dataset. See table 1 for an example row of this
process. We only have to pay for the expensive process of running TensorFlow models once by
processing the audio training examples before fitting models.
      </p>
      <p>We had four versions of the embedding dataset (emb). In emb_v1, we encountered missing
entries and Docker permission problems. We fixed this in emb_v2 but incorrectly mapped input
to the wrong model layer. Our first usable dataset was emb_v3, which fixed previous issues and
addressed the skew of long tracks by recursively chunking them. emb_v4 includes all possible
3-second intervals at a 1-second resolution. With the embedding and prediction vectors in a
dataset, we apply a series of heuristics and feature engineering for our final models.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Implementation and Workflow</title>
      <p>We split our workflow into training and inference. Our training pipeline runs on the Google
Cloud Platform (GCP), while inference runs in a Kaggle notebook optimized for ofline usage.</p>
      <p>
        We implement a shared Python package on GitHub. 1 The package defines environment
1Implementation at github.com/dsgt-birdclef/birdclef-2023
dependencies to run the training and inference workflows. It contains helper PySpark code
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], wrappers around BirdNET, and utilities for manipulating audio samples into matrices
representing sliding windows. We also define a workflow package containing Luigi [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] scripts
that implement dataset processing and annotation. We build Docker images for BirdNet and
MixIT and integrate modified versions of the canonical inference script into our data pipeline
as per figure 3.
      </p>
      <p>The inference pipeline is composed of three notebooks. The package sync notebook
downloads the shared Python package with all dependencies into a local directory. The model sync
notebook similarly downloads serialized models and weights from object storage. The final
inference notebook runs ofline after attaching the package and model sync notebooks as data
sources. We read the soundscapes, split them into chunks, obtained BirdNET embeddings, and
computed predictions submitted to the competition. See the appendix for the source code.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments</title>
      <p>We run experiments on the embedding dataset to maximize our performance on the public
leaderboard. The feature engineering process of embedding tokens and prediction logits is
part of the model-fitting process. We first focused on a baseline with minimal modifications to
the embedding dataset and then worked toward overcoming the idiosyncrasies of the training
dataset by more complex feature engineering. The input interval of BirdNET and the output
interval of the competition do not match, so we had to consider this diference. The former
expects 3-second intervals, while the latter expects 5-second intervals. Our two main approaches
are to aggregate the output of models that represent 3-second intervals and to aggregate input
to represent 5-second intervals.</p>
      <p>While we use simple measures such as macro-precision and accuracy to assess models against
the derived dataset in hyperparameter searches, we note that they did not reflect performance
against the public leaderboard. The results of our derived datasets and models are summarized
in table 3 and table 4, respectively.</p>
      <sec id="sec-5-1">
        <title>Description</title>
        <p>Simplified dataset where embedding tokens come from the channel
with the highest number of positive classifications against the
baseline model. Multi-label generated by averaging pairs and triplets of
embedding tokens together.</p>
        <p>Tokens are now the average of the first and third tokens of each
5second interval. We generate current, next, and track embeddings
using only source-separated tracks. Tokens are multi-labeled by the
primary and secondary species associated with the track.
Augments above but the top-20 tokens in each species are averaged
against random no-call tokens to simplify train-test splits.
Same methodology as post v2 and v3. Multi-label is generated by
confident baseline predictions filtered by plausible primary and
secondary metadata labels.</p>
        <p>Same as post v4, but it fixes a modulo bug in previous averaged-token
datasets.</p>
        <p>Drops notion of multi-label prediction. It uses the original track and
the best source-separated track to increase the number of training
examples.</p>
        <p>Adds logic to assign the primary label and no-call labels. It uses
concatenation instead of interpolation for a 5-second interval token and
includes the prediction logits for the current interval.
emb v3
emb v3
post v1
post v1
post v5
post v5
post v3
post v3
post v7
post v4
post v7</p>
        <sec id="sec-5-1-1">
          <title>Baseline</title>
        </sec>
        <sec id="sec-5-1-2">
          <title>Baseline</title>
          <p>Multi-label one-vs-rest strategy.
Interpolated token pairs and triplets.
Same as above, but weighted samples
and native multi-label training.
Current interpolated-token.</p>
          <p>Current and track interpolated-token.
Current, next, and track
interpolatedtoken.</p>
          <p>Current and next interpolated-token.
Current concatenated token.</p>
          <p>Ensemble of best logistic regression
and boost model.</p>
          <p>Current prediction logit softmax
vector.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>Public</title>
        <p>Score
0.78541
0.74014
0.79068
0.78829
0.7692
0.7489
0.75049
0.76484
0.75997
0.75091
0.71093</p>
      </sec>
      <sec id="sec-5-3">
        <title>Private</title>
        <p>
          Score
0.68369
0.62283
0.68181
0.68053
0.65937
0.63059
0.63877
0.65346
0.64414
0.64242
0.59652
5.1. Baseline Model
The baseline model uses embedding tokens taken from the isolated track source with the highest
energy, assuming that the loudest voice in the track is associated with the primary label. We
label the tokens according to the primary label if the max probability of the prediction vector
exceeds a threshold, e.g., 0.5; otherwise, we label the token as "no-call". We fit the data to logistic
regression, multi-layer perceptron (MLP), support vector machine (SVM), and gradient-boosted
decision tree (GBDT) classifiers. We use scikit-learn [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] for the first three of these models
and the scikit-learn compatible interface against XGBoost [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] for the latter. When applicable,
we perform a hyper-parameter search using sci-kit-optimize [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], which performs sequential
optimization using Bayesian methods.
        </p>
        <p>We hypothesize that classes cluster together in the low-dimensional embedding learned by
BirdNET and that linear models can efectively learn to discriminate between new classes. Our
baseline logistic regression classifier trained on the embedding tokens reaches a public/private
score of 0.78/0.68, notably better than the starter Kaggle notebook using the Google Research
Bird Vocalization Classifier with a score of 0.72/0.61. We find that SVM and GBDT are
comparable to the logistic regression and that MLP models require significant tuning to reach good
performance.</p>
        <p>GBDT via XGBoost is our preferred model because it trains quickly with a GPU with relatively
high predictive performance. While logistic regression has fewer parameters and performs just
as well, training can be slow as the number of examples increases. In table 5, logistic regression
takes 12x as long to train as XGBoost with GPU-based histogram binning.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.1.1. Baseline Binary No-call Model</title>
        <p>We explore and analyze the performance of a binary classifier to further our understanding
of the embedding space. While we do not use this model directly in the competition, it helps
quantify the quality of our automated labeling process.</p>
        <p>We construct a binary dataset with positive and negative embedding samples. Positive
signifies the presence of a birdcall in the sample, while negative denotes its absence. The
distribution of positive and negative samples within the dataset is well-balanced, with 49.7%
being positive and 50.3% being negative. About half of the available training data is empty
across original and sound-separated tracks. A logistic regression classifier achieved an accuracy
of 0.88 on this binary dataset.</p>
        <p>We create a second binary dataset from a subset of the first using the top three most common
species, with 49.8% of samples being positive and 50.2% being negative. A logistic regression
classifier is trained and predicted 1.0 accuracy on the smaller dataset. We hypothesize that this
behavior is due to the large number of species in the dataset. The distribution of some species
is highly skewed, as shown in Figure 4.</p>
        <p>Furthermore, we classified audio embeddings of the freefield1010 background soundscape
dataset [11] using the logistic regression classifier. The classifier predictions are probability
scores indicating the presence or absence of birdcalls in each sample. It predicted 77.3% of the
samples as having no birdcalls (no-call) and 22.7% as having birdcalls (call). However, if the
classifier’s predictions were perfect, the no-call percentage should have been 1.0, as the entire
dataset consisted only of background noise.
5.2. Interpolated Embedding Models
We build upon the baseline embedding model by interpolating embedding tokens to synthesize
new examples and labels. We assume that positive examples from each class tend to cluster
together in the high-dimensional embedding space. We also assume that embedding space
takes on a Euclidean geometry or admits some approximation. By interpolating examples, we
hypothesize that the resulting coordinate lies between the clusters and, therefore, closer to the
decision boundary of the sources.</p>
        <p>We first use interpolation to address the alignment problem between the 3-second interval of
BirdNET and the 5-second interval of the competition by generating a new feature. In addition,
we also construct a method that generates pairs and triplets of tokens sampled evenly across
classes. These interpolated tokens are assigned multiple labels used in a multi-label classifier.</p>
        <p>We experiment with interpolation to add contextual information to each example. For each
token in a track from our dataset , we generate the token that follows it directly in time +1
and the token that represents the entire track ∑︀
=0 . We generate features by concatenating
(⊕ ) each of these tokens and evaluate performance relative to each other using the same set of
labels.</p>
        <p>ˆ ∼ 1()
ˆ ∼ 3( ⊕
ˆ ∼ 2( ⊕ +1)</p>
        <p>∑︁ )
=0
ˆ ∼ 4( ⊕ +1 ⊕</p>
        <p>∑︁ )
=0
(1)
(2)
(3)
(4)</p>
        <p>Interpolation may provide some value, in particular around multi-labeling. We found that
our model trained on a dataset does not perform worse than our baseline model while avoiding
issues related to having a small number of training examples for the class. Most models trained
using a form of interpolated tokens resulted in lower leaderboard scores. However, these models
did not include interpolated pair and triplet tokens.
5.3. Concatenated Embedding Model
Another class of models that handles the time-interval discrepancy involves concatenating the
embedding tokens. We must train a classifier that accepts input in ℛ2320. This model performs
worse than the interpolated model. The increased dimensional of the underlying feature also
increases the model fit time, which leads us to skip augmented feature sets.
5.4. Ensemble Embedding Model
Our last embedding model uses the best models from our baseline and interpolated embedding
experiments. We train an XGBoost model trained on the outputs from the best classifiers. This
model is likely afected by the quality of the training dataset and the diferences in embedding
token semantics.
5.5. Probability Logit Model
Our final experiment uses the outputs of the final logit layer in the BirdNET model to determine
a class’ presence directly. We generate a probability vector by taking a softmax of the logit layer.
We ran into out-of-memory issues on our GPUs fitting XGBoost models using the probability
vector in ℛ3337 and found the scikit-learn logistic regression implementation needed to be
faster. We fit the data using a Complement Naive Bayes classifier instead of a GBDT or logistic
regression classifier. Naive Bayes assumption works well in this problem, where each feature is
treated independently toward the classification goal. It is also swift because it simply computes
counts over features. We use the Complement Naive Bayes model to address the heavy skew in
the class distribution but also find comparable performance across this family of classifiers. The
predictive performance is far worse than the baseline, with a public/private score of 0.71/0.59.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>6.1. Semi-Supervised Annotation Quality
The data quality is an aspect of our training dataset that we would like to explore more deeply
because the data quality afects the model’s quality. We have built a training workflow that
enables flexibility in labeling timestamps across the entire training audio to 1-second granularity.
While we succeeded in our baseline transfer learning experiment, having a human-labeled test
dataset independent of the competition leaderboard would be helpful. We can use ground-truth
annotations to test how well an automated labeling process does at assigning primary, secondary,
and no-call labels.</p>
      <p>It would also be worth exploring human annotation in the sound separation process. While
we have listened to several examples to judge the quality of sound separation, we need a method
to quantify quality. In the case of sorting by the highest energy source, we may confuse a source
with a significant amount of noise for the true birdcall just by the nature of higher entropy in
the source. In the case of sorting by the highest number of matching classifications from an
existing model, we may need to produce a classifier that can distinguish between noise and
birdcall for rare classes. In this case, we continue to propagate uncertainty into the resulting
labels, leading to poor performance, mainly when the number of examples is small.</p>
      <p>We could introduce a metric to measure the resulting separation quality rigorously. We could
create many positive birdcall clips and have a human determine which channel mainly contains
the primary species. These clips let us see how well our automated channel selection process
does at choosing the right channel based on human-generated labels. However, this metric
would not allow us to determine the separation quality in degenerate cases where a single
separated source contains multiple bird vocalizations.
6.2. Audio Source Separation
The MixIT source separation model plays a significant role in the embedding pipeline used for
our experiments. One of the most significant benefits is noise suppression. We also relied on
sound-separated channels to generate training examples and semi-supervised labels. Running
an ablation study by evaluating the labeling process without access to the sound separation
model would have been helpful.</p>
      <p>We wanted to explore the 8-source model, which could have diferent performance
characteristics than the 4-source model. However, we decided to ignore this model because the
training examples generally have few distinct vocalizations, and creating more source channels
increases the necessary disk space for the intermediate files.</p>
      <p>We are also interested in the efect of the sound separation model during inference. However,
we observed that the sound separation stage in training was a significant fraction of the
computation budget. There are also issues with diferent sample rates. BirdNET expects audio
sampled at 48kHz, while MixIT expects audio at 32KHz. We did not attempt to integrate the
model into this project because it would have put us over the submission time limit and required
significant engineering efort to run inside the competition environment.
6.3. Embedding Space and Transfer Learning
Our experiments with interpolated embeddings in section 5.2 had mixed results with respect
to the baseline embedding model. Including embedding context from other time intervals
had substantially lower performance than our baseline. The new labeling process may have
overshadowed potential positive efects. On the other hand, we saw a slight increase in our
model performance when using the mean of pairs and triplets during model fitting. Further
research is necessary to determine whether it is valuable to manipulate embeddings similarly
to mix up [12] that interpolates between training examples in the time domain to increase and
augment training data.</p>
      <p>We want to experiment with the embeddings from another model, such as the Google Research
Bird Vocalization Classifier. This classifier has seen more of the species in this competition than
BirdNET. It would be helpful to see how these two embeddings compare, and we could try this
out as another feature.</p>
      <p>Sequential models could be helpful in the competition by capturing dynamics and imbuing
contextual information between embedding tokens. The simplest model would be an
autoregressive linear model using an embedding from a single timestep to predict the next timestep
optimized by a squared error loss. We made initial forays into attention-based
sequence-tosequence models to address the output time-interval issue, but we needed more time to complete
our experimentation. Future work might explore data-driven methods like HAVOC [13] to
analyze the dynamics of birdcall audio and their embeddings.</p>
      <p>We would also like to explore the relationship between the sound-separated tracks and the
embedding. The separation model is constrained so that the sum of the sources results in the
original track. It would be interesting to verify a relationship between embeddings of various
tracks by fitting a predictive model that takes tokens from each source to predict the embedding
of the original track. This line of thought does not directly help with model performance on the
ifnal task, but it does help understand the nature of the classifier embedding space.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>In summary, our approach leveraged the embedding space learned by BirdNET to address
the representation and labeling challenges in the competition. We developed a pipeline that
included sound separation with MixIT, extraction of embedding tokens using BirdNET, and
the creation of annotated datasets. Our results showcased the competitive performance of our
logistic regression baseline model as an efective method on unseen species and the comparative
performance of various feature engineering work. Our approach demonstrated the potential of
transfer and semi-supervised learning for bird species classification in soundscapes.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>Thanks to the Data Science at Georgia Tech (DS@GT) club for hosting our Kaggle competition
team. Thanks to DS@GT leadership for publicizing recruitment, particularly Krishi Manek as
the Director of Projects. Thanks to Erin Middlemas and Grant Williams for their support and
engagement as initial team members.
tève, L. Besson, M. Cherti, K. Pfannschmidt, F. Linzberger, C. Cauet, A. Gut, A. Mueller,
A. Fabisch, scikit-optimize/scikit-optimize: v0.5.2, 2018. URL: https://doi.org/10.5281/
zenodo.1207017. doi:10.5281/zenodo.1207017.
[11] D. Stowell, M. D. Plumbley, An open dataset for research on audio field recording archives:
freefield1010, arXiv preprint arXiv:1309.5275 (2013).
[12] H. Zhang, M. Cisse, Y. N. Dauphin, D. Lopez-Paz, mixup: Beyond empirical risk
minimization, 2018. arXiv:1710.09412.
[13] S. L. Brunton, B. W. Brunton, J. L. Proctor, E. Kaiser, J. N. Kutz, Chaos as an intermittently
forced linear system, Nature communications 8 (2017) 19.</p>
    </sec>
    <sec id="sec-9">
      <title>A. Kaggle Inference Sync Notebooks</title>
      <p>A.1. birdnet-transfer-learning-package-sync
# %% [code]
# Clone the repo and all of the submodules; this way, we can include the extras
! if [[ -d birdclef-2023 ]]; then rm -rf birdclef-2023; fi
! git clone --recurse-submodules https://github.com/dsgt-birdclef/birdclef-2023.git
! pip download -d pip-packages --prefer-binary ./birdclef-2023 tensorflow==2.11.0
A.2. birdnet-transfer-learning-model-sync
from google.cloud import storage
from pathlib import Path
def download(client, bucket, prefix, relative):
for blob in client.list_blobs(bucket, prefix=prefix):
name = Path(blob.name).relative_to(relative)
name.parent.mkdir(parents=True, exist_ok=True)
blob.download_to_filename(name)
print(f"downloaded {name}")
client = storage.Client(project="birdclef-2023")
download(client, "birdclef-2023", "data/models/birdnet-analyzer-pruned", "data")
download(client, "birdclef-2023", "data/raw/sound_separation", "data")
download(client, "birdclef-2023", "data/models/baseline", "data")
download(client, "birdclef-2023", "data/models/ensembles", "data")
A.3. birdnet-transfer-learning-inference
# %% [code]
# move the package into a temp directory so it's editable
! mkdir -p /kaggle/temp/birdclef-2023
! rsync -r \
/kaggle/input/birdnet-transfer-learning-package-sync/birdclef-2023/ \
/kaggle/temp/birdclef-2023/
! pip install \
--no-index \
--find-links=/kaggle/input/birdnet-transfer-learning-package-sync/pip-packages \
/kaggle/temp/birdclef-2023
# %% [markdown]
# ## main submission code
# %% [code]
import pickle
import time
from functools, import partial
from pathlib import Path
import librosa
import numpy as np
import tensorflow as tf
import tqdm
from pyspark.sql import functions as F
from birdclef import birdie
from birdclef.utils import get_spark
from birdclef.data.utils import slice_seconds
repo_path = Path(
"/kaggle/input/birdnet-transfer-learning-model-sync"
"/models/birdnet-analyzer-pruned"
)
birdnet_model = birdnet.load_model_from_repo(repo_path)
embedding_func = birdnet.embedding_func(birdnet_model)
prediction_func = birdnet.prediction_func(birdnet_model)
model_prefix = "/kaggle/input/birdnet-transfer-learning-model-sync/models"
model_name = "acm-model-concat-v2"
model_path = Path(f"{model_prefix}/baseline_v2/{model_name}.pkl")
clf = pickle.loads(model_path.read_bytes())
# re-encode the classes properly for the inference script on xgboost
encoder_name = f"{model_name}_mlb"
le_path = Path(f"{model_prefix}/baseline_v2/{encoder_name}.pkl")
le = pickle.loads(le_path.read_bytes())
clf.classes_ = le.classes_
# %% [code]
def prepare_embedding(X, use_next=True, use_global=True):
# (120*2, sr*3)
# the current tokens
X_current = X.reshape(-1, 2, X.shape[-1]).mean(axis=1)
X_res = X_current
# the next token
if use_next:</p>
      <p>X_next = np.roll(X_current, shift=1, axis=0)
X_next[0] = X_current[0]</p>
      <p>X_res = np.concatenate([X_res, X_next], axis=1)
# the global or track tokens. we average over every 2-minute interval
if use_global:</p>
      <p>X_global = X_current.reshape(-1, 12*2, X_current.shape[-1]).mean(axis=1)
X_global = np.repeat(X_global, 12*2, axis=0)</p>
      <p>X_res = np.concatenate([X_res, X_global], axis=1)
return X_res
def prepare_embedding_concat(X, **kwargs):</p>
      <p>
        return X.reshape(-1, X.shape[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]*2)
def run_inference(path, embedding_func, prediction_func, clf, sr=48000, **kwargs):
y, sr = librosa.load(path.as_posix(), sr=sr, mono=True)
emb = embedding_func(X)[0]
prob = clf.predict_proba(prepare_embedding_concat(emb, **kwargs))
assert prob.shape == (120, len(clf.classes_)), (prob.shape, len(clf.classes_))
rows = []
for ts, probs in zip(range(0, 600, 5), prob):
row = dict(
row_id=f"{path.stem}_{ts+5}",
**dict(zip(clf.classes_, np.around(probs, 6).tolist()))
)
rows.append(row)
return rows
test_path = Path("/kaggle/input/birdclef-2023/test_soundscapes")
rows = []
timings = []
for path in tqdm.tqdm(test_path.glob("*.ogg")):
start = time.time()
rows += run_inference(
      </p>
      <p>path, embedding_func, prediction_func, clf, use_next=True, use_global=True
)
timings.append(time.time() - start)
avg_time_sec = np.mean(timings)
est_time_min = avg_time_sec*200/60
print(
f"took {round(avg_time_sec,2)} seconds per loop, "
"estimated {round(est_time_min,2)} minutes"
# %% [code]
# normalize the output schema with the sample
spark = get_spark()
sample_submission_df = spark.read.csv(
"/kaggle/input/birdclef-2023/sample_submission.csv",
header=True,
inferSchema=True
)
rows_df = spark.createDataFrame(rows)
submission_df = rows_df.select(sample_submission_df.columns).toPandas()
submission_df.to_csv("submission.csv", header=True, index=False)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Reers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Cherutich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Glotin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-P.</given-names>
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Planqué</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of BirdCLEF 2023:
          <article-title>Automated bird species identification in eastern africa</article-title>
          ,
          <source>Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Botella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Deneu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Marcos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Estopinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chamidullin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Šulc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hrúz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Glotin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Planqué</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-P.</given-names>
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Eggel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          , Overview of LifeCLEF 2023:
          <article-title>evaluation of AI models for the identification and prediction of birds, plants, snakes and fungi</article-title>
          ,
          <source>in: International Conference of the Cross-Language Evaluation Forum for European Languages</source>
          , Springer,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Wood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Eibl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <article-title>Birdnet: A deep learning solution for avian diversity monitoring</article-title>
          ,
          <source>Ecological Informatics</source>
          <volume>61</volume>
          (
          <year>2021</year>
          )
          <fpage>101236</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>McInnes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Healy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Melville</surname>
          </string-name>
          , Umap:
          <article-title>Uniform manifold approximation and projection for dimension reduction</article-title>
          ,
          <year>2020</year>
          . arXiv:
          <year>1802</year>
          .03426.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wisdom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Hershey</surname>
          </string-name>
          ,
          <article-title>Improving bird classification with unsupervised sound separation</article-title>
          ,
          <source>in: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>636</fpage>
          -
          <lpage>640</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Armbrust</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Xin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. K.</given-names>
            <surname>Bradley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kaftan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Franklin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ghodsi</surname>
          </string-name>
          , et al.,
          <article-title>Spark sql: Relational data processing in spark</article-title>
          ,
          <source>in: Proceedings of the 2015 ACM SIGMOD international conference on management of data</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1383</fpage>
          -
          <lpage>1394</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>[7] Luigi 2.8</source>
          .13 documentation, https://luigi.readthedocs.io/en/stable/,
          <year>2023</year>
          . Accessed:
          <fpage>2023</fpage>
          - 06-07.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vanderplas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cournapeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perrot</surname>
          </string-name>
          , E. Duchesnay,
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          (
          <year>2011</year>
          )
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          , C. Guestrin,
          <article-title>XGBoost: A scalable tree boosting system</article-title>
          ,
          <source>in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '16</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA,
          <year>2016</year>
          , pp.
          <fpage>785</fpage>
          -
          <lpage>794</lpage>
          . URL: http://doi.acm.
          <source>org/10</source>
          .1145/ 2939672.2939785. doi:
          <volume>10</volume>
          .1145/2939672.2939785.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Head</surname>
          </string-name>
          , MechCoder, G. Louppe, I. Shcherbatyi, fcharras, Z. Vinícius, cmmalone, C. Schröder, nel215,
          <string-name>
            <given-names>N.</given-names>
            <surname>Campos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Young</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cereda</surname>
          </string-name>
          , T. Fan, rene rex,
          <string-name>
            <surname>K. K. Shi</surname>
            , J. Schwabedal, carlosdanielcsantos, Hvass-Labs,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Pak</surname>
            , SoManyUsernamesTaken,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Callaway</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <article-title>EsX = slice_seconds(y, sr, seconds=3, step=1) # drop every 4th/5th index, so we're not processing more than we need to # ,First pad the resulting slices by 2 X = np.pad(X, ((0, 2</article-title>
          ), (
          <issue>0</issue>
          , 0)))
          <article-title># then reshape it X = X.reshape(-1, 5</article-title>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          <year>shape</year>
          [-1]) #
          <article-title>Now drop the last 2 seconds of each 5-second frame X = X[:</article-title>
          , [
          <volume>0</volume>
          ,2], :].
          <string-name>
            <surname>reshape</surname>
            (-1,
            <given-names>X.</given-names>
          </string-name>
          <article-title>shape[-1]) assert X.shape == (120*2</article-title>
          , sr*3), X.shape
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>