<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Bird-Species Audio Identification, Ensembling of EficientNet-B0 and Pre-trained EficientNet-B1 model</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aaditya Porwal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Institute of Technology</institution>
          ,
          <addr-line>Dhanbad</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this study, I present a novel approach to audio classification, specifically for the BirdCLEF 2024 challenge, by employing an ensemble of EficientNet models. My methodology integrates EficientNet-B0, trained exclusively on the current competition's data, and EficientNet-B1, pre-trained on datasets from previous BirdCLEF competitions. The EficientNet-B0 model leverages heavy augmentation techniques to enhance generalization and robustness. Data preprocessing involves transforming audio signals into Mel spectrograms, optimized through feature engineering and augmentation methods. The ensemble strategy, combining predictions from both models, achieves superior performance compared to individual models. My results demonstrate the eficacy of this approach, with significant improvements in classification accuracy and robustness, exemplified by achieving the 25th rank out of 975 competitors on the BirdCLEF 2024 leaderboard.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Deep Learning</kwd>
        <kwd>Bird Species Classification</kwd>
        <kwd>Transfer Learning</kwd>
        <kwd>Attention Mechanism</kwd>
        <kwd>Sound Detection</kwd>
        <kwd>Audio Source Detection</kwd>
        <kwd>EficientNet</kwd>
        <kwd>Ensembling</kwd>
        <kwd>Audio Classification</kwd>
        <kwd>BirdCLEF</kwd>
        <kwd>Ensemble Learning</kwd>
        <kwd>Data Augmentation</kwd>
        <kwd>Mel Spectrogram</kwd>
        <kwd>Convolutional Neural Network</kwd>
        <kwd>Feature Engineering</kwd>
        <kwd>ROC-AUC</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        There are about 10,000 diferent bird species in this world, and they all play an important role in the
natural world. Birds are excellent indicators of biodiversity change since they are highly mobile and
have diverse habitat requirements. BirdCLEF 2024 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is a Kaggle competition organized by The Cornell
Lab of Ornithology in collaboration with LifeCLEF 2024 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], whose challenge is to identify which
birds are calling in long recordings, given training data generated in meaningfully diferent contexts.
The BirdCLEF 2024 competition focuses on identifying bird calls in long recordings, particularly from
the sky-islands of the Western Ghats. This competition presents significant challenges, including
imbalanced training data per species, domain shifts between training and test data, and a strict two-hour
time limit for analyzing extensive recordings.
      </p>
      <p>This paper is structured to first provide details of the competition and the given data to ensure a clear
understanding of the challenges posed by the train and test data. Additionally, I will provide a detailed
solution to the approaches used for this challenge, including data preparation, approach, augmentations,
model building, training procedures, post-processing techniques, and conclusion. If successful, this
efort will advance ongoing initiatives to protect avian biodiversity in the Western Ghats, India.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Data</title>
      <sec id="sec-2-1">
        <title>2.1. Training Data</title>
        <p>• Train metadata: Along with audio files, metadata is also provided which consists of primary
label, secondary labels, type, latitude, longitude, scientific name, common name, author, filename,
license, rating, and URL.
• Unlabeled soundscapes: Unlabeled audio from the same locations as the test soundscapes is
provided.</p>
        <p>• eBird_Taxonomy_v2021.csv: Contains data on species relationships.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Test Data</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. My Approach</title>
      <p>Test_soundscapes: The test_soundscapes directory will be populated with approximately 1,100
recordings to be used for scoring. They are 4 minutes long and in ogg audio format.</p>
      <p>
        My final approach to this dataset was to ensemble two diferent EficientNets (B0 and B1) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The B1
model [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] was conditioned on data from previous years BirdCLEF competitions during its pretraining
stage, while the B0 model [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] was trained using only this competition’s data.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. EficientNet-B0 using heavy augmentation</title>
      <sec id="sec-4-1">
        <title>4.1. Overview</title>
        <p>
          The task of audio classification involves identifying and categorizing audio signals into predefined
classes. In this project, I employed EficientNet-B0, a highly eficient convolutional neural network
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], to classify audio signals. EficientNet-B0 is chosen because of its balance of performance and
computational eficiency.
        </p>
        <p>The audio data is preprocessed and transformed into Mel spectrograms, which are fed into the
EficientNet-B0 model. The model is trained using a variety of data augmentation techniques to enhance
generalization and robustness. The classification performance is optimized through advanced feature
engineering, cross-validation, and careful selection of hyperparameters.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Data Preparation</title>
        <p>
          The audio data used in this project is sampled at a rate of 32,000 samples per second (sr = 32000). Each
audio clip for training has a duration of 30 seconds. To handle the diverse length of audio files, a
ifxed length of 30 seconds is set for all training samples. From these 30-second clips, random 5-second
segments are extracted for Short-Time Fourier Transform (STFT) processing [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. For testing, a uniform
duration of 5 seconds is used. More details are provided in 1.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Feature Engineering</title>
        <p>
          • Feature engineering focuses on transforming raw audio data (acoustic signal) into a format
suitable for model training. Mel spectrograms are generated from the audio signals, converting
them into a visual representation of frequency content over time. This transformation is achieved
using Short-Time Fourier Transform (STFT) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], where each 5-second audio slice undergoes
Fourier transformation to capture the spectral properties. An illustration of this process can be
seen in Figure 1.
• Additionally, various augmentation techniques are employed to enhance the training data.
        </p>
        <p>
          Spectrogram-specific augmentations like masking and coarse dropout [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] are used, further
diversifying the training data. Mixup augmentation [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], both in waveform and spectrogram
forms, combines multiple examples to create synthetic training samples, enhancing the model’s
robustness and generalization capabilities.
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Model Building</title>
        <p>The model architecture is based on EficientNet-B0. It incorporates average pooling (pool_type =
’avg’) and utilizes Binary Cross-Entropy with Focal Loss (loss_type = "BCEFocalLoss") to address class
imbalance. More details are provided in Table 2.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Augmentation</title>
        <p>Data augmentation plays a crucial role in improving the model’s generalization. Various augmentation
techniques are employed to introduce variability in the training data. These include:
• Waveform Augmentation:
– Random Noise Addition: This technique involves adding random noise to the audio signal
to simulate real-world audio conditions and improve the model’s robustness to background
noise.
– Gain Adjustments: Adjusting the volume levels to handle recordings with diferent gain
levels. This helps the model generalize better across recordings with varying loudness.
– Pitch Shifting: This technique was considered but ultimately set to 0 because it disrupted
the distinct frequency patterns critical for bird call classification.
– Time Shifting: Similarly, time shifting was considered but set to 0 as it disrupted the
temporal patterns essential for accurate classification.
• Spectrogram Augmentation:
– Masking Parts of the Spectrogram: Techniques like spectrogram masking were employed
to hide parts of the spectrogram, making the model more robust to missing data. However,
this did not show significant improvements, due to the distinct and critical nature of the
bird call frequency patterns.
– Randomly Dropping Coarse Regions: This technique, known as coarse dropout, was
used to randomly drop regions of the spectrogram. It aimed to make the model more robust,
but preliminary experiments showed limited efectiveness.
– Horizontal Flipping: Not used because flipping the time axis of a spectrogram is not
meaningful for audio data and would disrupt the temporal sequence of the sound.
• Mixup Augmentation: Both waveform and spectrogram mixup techniques were employed.</p>
        <p>This involves linearly combining two examples to create a new synthetic example, enhancing the
model’s robustness and generalization capabilities. This was controlled by parameters such as:
– Waveform Mixup (aug_wave_mixup): Set to 1.0. This parameter indicates that
waveform mixup was applied to all audio samples during training. Waveform mixup involves
combining the waveforms of two diferent audio samples by taking a weighted average of
their amplitudes.
– Spectrogram Mixup (aug_spec_mixup): Set to 0.0. This parameter indicates that
spectrogram mixup was not applied to the spectrograms during training. Spectrogram mixup
would involve combining the spectrogram representations of two diferent audio samples in
a similar manner to waveform mixup.
– Probability of applying Spectrogram Mixup (aug_spec_mixup_prob): Set to 0.5. This
parameter defines the probability with which spectrogram mixup would be applied to an
audio sample during training. Even though spectrogram mixup was set to 0.0 in this case,
this parameter would control the likelihood of its application if it were used.</p>
        <p>The mix ratio for combining the samples was determined by a Beta distribution with  = 0.95. The
Beta distribution is commonly used in mixup techniques to control the interpolation between two
samples. A Beta distribution with  = 0.95 produces mix ratios that are generally close to 0 or 1,
meaning that the synthetic samples are predominantly composed of one of the original samples, with
only a small contribution from the other. This helps maintain the distinct features of each sample while
still providing the benefits of data augmentation.</p>
      </sec>
      <sec id="sec-4-6">
        <title>4.6. Training Procedure</title>
        <p>The training procedure involves cross-validation, early stopping, and augmentation strategies. The
training configurations are provided in Table 3.</p>
        <p>The mixup function combines two examples in the dataset to create a new example, enhancing
generalization. Spectral mixup further diversifies training data by combining spectrograms from
diferent audio samples. These strategies help the model learn robust and generalized representations,
improving classification performance.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. EficientNet-B1 with Pre-Training</title>
      <sec id="sec-5-1">
        <title>5.1. Overview</title>
        <p>
          The task of audio classification involves categorizing audio signals into predefined classes. In this project,
I developed a model for identifying bird calls using TensorFlow and the EficientNet-B1 architecture,
drawing inspiration from previous work on pre-training by Awsaf (Md Awsafur Rahman) [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].The
model was pre-trained on BirdCLEF datasets from 2021-2023 and Xeno-Canto Extend, and fine-tuned
on BirdCLEF 2024 data to enhance transfer learning. Advanced audio processing and feature extraction
techniques were employed to optimize performance on TPU devices, addressing challenges such as
spectrogram augmentation and efective transfer learning.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Data Preparation</title>
        <p>
          Data Sources: BirdCLEF datasets from 2021-2024 and Xeno-Canto Extend [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] [15] [16] [17].
The raw audio data, stored in .ogg format, is eficiently handled using the ‘tf.data‘ API.
        </p>
        <p>Each audio clip is sampled at 32,000 Hz and has a uniform duration of 10 seconds. To efectively
capture the audio features, the spectrogram parameters are carefully chosen. The frequency range spans
from 20 Hz to 16,000 Hz. The Short-Time Fourier Transform (STFT) parameters include an FFT window
size of 2028 and a spectrogram window size of 2048. These configurations ensure that the critical audio
characteristics are well-represented for model training. More details are provided in Table 4.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Feature Engineering</title>
        <p>Feature engineering focuses on transforming raw audio data into a format suitable for model training.
Mel spectrograms are generated from the audio signals, converting them into a visual representation of
frequency content over time. This transformation is achieved using the ‘MelSpectrogram‘ layer, where
each 10-second audio slice undergoes Fourier transformation to capture the spectral properties.</p>
        <p>
          To improve the model’s robustness, I apply various augmentation techniques on the spectrograms:
• Time and Frequency Masking: Randomly masks parts of the spectrogram in both time and
frequency dimensions.
• Normalization: Standardizes the data using mean and standard deviation, followed by rescaling
to the [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] range.
        </p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Model Building</title>
        <p>The model architecture is based on EficientNet-B1, a convolutional neural network known for its
eficiency and performance. Key configurations include:
• Pretraining: Initialized with ImageNet weights to leverage transfer learning. The Final Activation
Function used is Softmax for multi-class classification. Filter Stride Reduction (FSR) used for
reducing the stride in the stem block.</p>
        <p>The model incorporates several custom layers using TensorFlow library to handle specific tasks. This
can also be implemented using PyTorch.</p>
        <p>• MelSpectrogram Layer: Converts audio signals into Mel spectrograms.
• TimeFreqMask Layer: Applies time and frequency masking for spectrogram augmentation.
• ZScoreMinMax Layer: Standardizes and rescales the spectrogram data.</p>
        <p>• MixUp and CutMix Layers: Augment the training data by mixing audio samples.</p>
      </sec>
      <sec id="sec-5-5">
        <title>5.5. Augmentation</title>
        <p>
          Data augmentation plays a crucial role in improving the model’s generalization. Various augmentation
techniques are employed to introduce variability in the training data:
• Audio Augmentation:
– Gaussian Noise Addition: This technique was chosen to simulate diferent environmental
noise conditions and make the model robust to noise. Applied with a probability of 0.5, it
adds random noise to the audio signal, improving the model’s ability to generalize to noisy
data.
– Time Shifting: Shifting the audio signal in time to introduce variability, but set to 0 as it
disrupted the temporal patterns essential for accurate classification.
• Spectrogram Augmentation:
– MixUp: This technique involves mixing two audio signals to create a synthetic example,
applied with a probability of 0.65. This helps the model generalize better by providing varied
training examples.
– CutMix: Similar to MixUp, CutMix combines two audio signals but by cutting and pasting
parts of each, also applied with a probability of 0.65.
– Time and Frequency Masking: This technique was chosen to make the model more
robust to missing data by randomly masking parts of the spectrogram in both time and
frequency dimensions, applied with a probability of 0.5. It helps the model learn to handle
occlusions and missing parts in the data.
– Normalization: The ‘ZScoreMinMax‘ layer standardizes the spectrogram data using mean
and standard deviation, followed by rescaling to the [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] range. This ensures that the data
fed into the model is on a consistent scale, improving learning stability.
        </p>
        <p>Efectiveness of Augmentation Techniques : The pre-trained EficientNet-B1 model benefited
more from time and frequency masking techniques compared to the EficientNet-B0 model. This is likely
because the EficientNet-B1 model had already learned general audio features during its pre-training
phase, making it more adaptable to variations introduced by augmentations. The EficientNet-B0 model,
lacking this pre-trained knowledge, struggled with the same augmentations as it was still learning
fundamental patterns from the current dataset.</p>
      </sec>
      <sec id="sec-5-6">
        <title>5.6. Training Procedure</title>
        <p>The training procedure involves a carefully designed pipeline to ensure efective learning and
generalization. The dataset is stratified into five folds for cross-validation, with classes with very few samples
always included in the training set to address class imbalance. Upsampling is employed to ensure
that minority classes are adequately represented in the training data. The training configurations are
provided in Table 5.</p>
        <p>The model is trained on a TPU device, utilizing the TPU-VM for automatic device selection and
training acceleration. Early stopping is implemented to prevent overfitting, with a patience of 5 epochs.
The model’s performance is evaluated using the padded cmAP (macro-averaged average precision)
score, which accounts for class imbalance and zero true positive labels for certain species. Additionally,
the Precision Recall (PR) curve is used as the primary metric for AUC evaluation.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Post Processing</title>
      <p>Post-processing is integral to refining model predictions and enhancing overall classification
performance. The ensemble method combines the predictions of two distinct models to leverage their
individual strengths, thereby enhancing robustness and accuracy.</p>
      <sec id="sec-6-1">
        <title>6.1. Ensemble Strategy</title>
        <p>To achieve optimal classification results, I implemented an ensemble method where the final
prediction for each audio clip is computed as a weighted average of predictions from EficientNet-B0 and
EficientNet-B1. The ensemble weights were empirically determined based on cross-validation
performance: 0.6 for EficientNet-B0 and 0.4 for EficientNet-B1. This weighting scheme balances the unique
capabilities of each model efectively.</p>
        <p>Final Prediction = 0.6 × PredictionsEficientNet-B0 + 0.4 × PredictionsEficientNet-B1</p>
        <p>This weighted average helps in smoothing out the variances and combining the high-confidence
predictions from each model. This approach efectively leverages the complementary strengths of
EficientNet-B0 and EficientNet-B1, providing a comprehensive solution for bird classification in the
BirdCLEF24 challenge.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Results</title>
      <p>
        The following table summarizes the performance of various models and strategies evaluated in this
study, measured by their private and public scores on the BirdCLEF 2024 competition leaderboard. The
scores represent the macro-averaged ROC-AUC [18], accounting for class imbalance and providing a
robust measure of model performance.
7.1. Analysis
• EficientNet-B0: The EficientNet-B0 model [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] demonstrated consistent performance, achieving
a private score of 0.649998 and a public score of 0.654307. This balanced performance across both
datasets indicates its robustness and generalization capabilities in handling the BirdCLEF dataset.
• EficientNet-B1 with Pretraining: The EficientNet-B1 model [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which was enhanced with
pretraining, showed a slight dip in the private score (0.596740) compared to its public score
(0.632268). This suggests that while pretraining improved its performance on the public dataset,
it may have led to some overfitting or less efective generalization on the private dataset.
• Models Ensemble: The ensemble of models [19] achieved the highest scores, with a private
score of 0.652743 and a public score of 0.663388. This approach efectively leveraged the strengths
of multiple models, leading to better overall performance and indicating the efectiveness of
model ensembling in complex tasks such as bird species identification from audio recordings.
      </p>
      <p>Overall, the results were very stable, with a correlation of 0.96 between public and private scores
[20]. However, there were significant changes visible in the public and private leaderboards.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusion</title>
      <p>The ensemble of EficientNet-B0 and EficientNet-B1 models proved to be an efective strategy for
the BirdCLEF 2024 challenge, outperforming individual models and enhancing overall classification
performance. EficientNet-B0, trained with heavy data augmentation, and EficientNet-B1, pre-trained
on historical BirdCLEF datasets, complemented each other well. The ensemble approach leveraged
the strengths of both models, achieving a balance between computational eficiency and classification
accuracy. My findings underscore the importance of combining diverse data sources and robust
augmentation techniques in building resilient audio classification systems. Future work could explore
further optimization of ensemble weights and the incorporation of additional data augmentation
methods to continue improving performance in audio classification tasks.</p>
    </sec>
    <sec id="sec-9">
      <title>9. Acknowledgments</title>
      <p>I would like to thank Stefan Kahl, Willem-Pier Vellinga, Tom Denton, Holger Klinck, and Hervé Glotin for
their exceptional leadership and expertise throughout the BirdCLEF24 competition. I am also immensely
grateful to the collaborating institutions—Kaggle, Chemnitz University of Technology, Google Research,
the K. Lisa Yang Center for Conservation Bioacoustics at the Cornell Lab of Ornithology, the Indian
Institute of Science Education and Research (IISER) Tirupati, LifeCLEF, and Xeno-canto—for providing
invaluable resources, data, and support. Their collective contributions have been crucial to the success
of this project.
[15] Kaggle, Birdclef 2024 competition, 2024. URL: https://www.kaggle.com/competitions/birdclef-2024.
[16] R. Rao, Xeno-canto bird recordings extended a-m, 2024. URL: https://www.kaggle.com/datasets/
rohanrao/xeno-canto-bird-recordings-extended-a-m.
[17] R. Rao, Xeno-canto bird recordings extended n-z, 2024. URL: https://www.kaggle.com/datasets/
rohanrao/xeno-canto-bird-recordings-extended-n-z.
[18] Metric, Birdclef roc auc, 2024. URL: https://www.kaggle.com/code/metric/birdclef-roc-auc.
[19] A. Porwal, Silver medal solution - 25th place, 2024. URL: https://www.kaggle.com/code/
aadityaporwal/silver-medal-solution-25th-place.
[20] Correlation between public and private scores, 2024. URL: https://www.kaggle.com/competitions/
birdclef-2024/discussion/512197.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Srivathsa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Arvind</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>CP</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sawant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. V.</given-names>
            <surname>Robin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Glotin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-P.</given-names>
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Planqué</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of BirdCLEF 2024:
          <article-title>Acoustic identification of under-studied bird species in the western ghats</article-title>
          ,
          <source>Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Espitalier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Botella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Deneu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Marcos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Estopinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leblanc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Šulc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hrúz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          , et al.,
          <source>Overview of lifeclef</source>
          <year>2024</year>
          :
          <article-title>Challenges on species distribution prediction and identification</article-title>
          ,
          <source>in: International Conference of the CrossLanguage Evaluation Forum for European Languages</source>
          , Springer,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          , Eficientnet:
          <article-title>Rethinking model scaling for convolutional neural networks</article-title>
          , arXiv preprint arXiv:
          <year>1905</year>
          .
          <volume>11946</volume>
          (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>1905</year>
          .11946.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Aikhmelnytskyy</surname>
          </string-name>
          ,
          <article-title>Birdclef24 pretraining is all you need -</article-title>
          infer,
          <year>2024</year>
          . URL: https://www.kaggle. com/code/aikhmelnytskyy/birdclef24-pretraining
          <article-title>-is-all-you-need-infer.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <issue>TC0000</issue>
          , Birdclef starter notebook,
          <year>2024</year>
          . URL: https://www.kaggle.com/code/tc0000/ birdclef-starter-notebook.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>K. O'Shea</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Nash</surname>
          </string-name>
          ,
          <article-title>An introduction to convolutional neural networks</article-title>
          ,
          <source>arXiv preprint arXiv:1511.08458</source>
          (
          <year>2015</year>
          ). URL: https://arxiv.org/abs/1511.08458.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Nicholson</surname>
          </string-name>
          ,
          <year>Birdclef 2024</year>
          :
          <article-title>Spectrograms imagenet run</article-title>
          ,
          <year>2024</year>
          . URL: https: //www.kaggle.com/code/richolson/birdclef-2024
          <string-name>
            <surname>-</surname>
          </string-name>
          spectrograms
          <article-title>-imagenet-run# Initialize-submit-DF-with-correct-columns.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pan</surname>
          </string-name>
          , Z. Han,
          <article-title>Short-time fourier transform with the window size fixed in the frequency domain (</article-title>
          <year>2017</year>
          ). URL: https://www.researchgate.net/publication/321043608_Short-Time_
          <article-title>Fourier_Transform_with_the_Window_Size_Fixed_in_the_Frequency_Domain.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Muljana</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.-P. M. Luo</surname>
          </string-name>
          ,
          <article-title>A review of online course dropout research: Implications for practice and future research</article-title>
          ,
          <source>ResearchGate</source>
          (
          <year>2019</year>
          ). URL: https://www.researchgate.net/publication/227246914_ A_
          <article-title>review_of_online_course_dropout_research_Implications_for_practice_and_future_research.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yosinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clune</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lipson</surname>
          </string-name>
          ,
          <article-title>How transferable are features in deep neural networks?</article-title>
          ,
          <source>arXiv preprint arXiv:1411.1792</source>
          (
          <year>2014</year>
          ). URL: https://arxiv.org/pdf/1411.1792.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <fpage>Awsaf49</fpage>
          ,
          <article-title>Birdclef23 pretraining is all you need -</article-title>
          train,
          <year>2023</year>
          . URL: https://www.kaggle.com/code/ awsaf49/birdclef23
          <article-title>-pretraining-is-all-you-need-train.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Kaggle</surname>
          </string-name>
          ,
          <year>Birdclef 2021</year>
          competition,
          <year>2021</year>
          . URL: https://www.kaggle.com/competitions/birdclef-2021.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Kaggle</surname>
          </string-name>
          ,
          <year>Birdclef 2022</year>
          competition,
          <year>2022</year>
          . URL: https://www.kaggle.com/competitions/birdclef-2022.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Kaggle</surname>
          </string-name>
          ,
          <year>Birdclef 2023</year>
          competition,
          <year>2023</year>
          . URL: https://www.kaggle.com/competitions/birdclef-2023.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>