<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Dual-branch Network for Species Identification via Passive Acoustic Monitoring</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jingyin Tan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aiguo Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Computer Science and Artificial Intelligence, Foshan University</institution>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The BirdCLEF2025 task aims to train a multi-label classifier to infer the presence probabilities of multiple species from the audio signals. In this work, we introduce a dual branch architecture to build an end-to-end passive acoustic monitoring predictor. Specifically, two diferent types of acoustic features (i.e., Mel-spectrogram and Mel-Frequency Cepstral Coeficients) are first extracted from the raw signals. ResNet and ConvNeXt are then used to learn two-branch features. Afterwards, the features are concatenated and fed into a fully connected layer to output the prediction probabilities. Finally, we conduct comparative experiments on the competition test data. Experimental results show that the proposed model achieves 0.751 and 0.771 macro-mean ROC-AUC on the 34% and 66% test dataset, respectively.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;species identification</kwd>
        <kwd>dual-branch</kwd>
        <kwd>acoustic features</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The identification of under-studied species via passive acoustic monitoring is less afected by weather
and more detectable in enhancing biodiversity monitoring[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], compared with conventional
observerbased biodiversity surveys[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The BirdCLEF2025[
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ] competition aims to predict the presence of
each of 206 target species in every 5-second segment of audio recordings, which is an important
application scenario of passive acoustic monitoring. Accordingly, researchers have explored various
methods in BirdCLEF2024 towards higher accuracy. For example, the approach in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] utilizes a transfer
learning method based on pseudo multi-labels, demonstrating the efectiveness of leveraging
pretrained embeddings for birdcall classification. The method in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] designs an ensembled model that is a
combination of EficientNet-B0 and EficientNet-B1 to leverage the strengths of diferent models.
      </p>
      <p>Although previous methods have achieved promising results, they overlook the usage of
multichannel features and the impact of pretrained weights on the feature mapping backbones. To this end,
we in this work propose a dual-branch neural network that uses two types of acoustic features to better
learn the latent meaningful features. The main contributions of this work are outlines are follows.</p>
      <p>(1) A dual-branch neural network is proposed to build an end-to-end passive acoustic monitoring
predictor to identify species. Two types of features, including Mel-spectrogram (Mel) and Mel-Frequency
Cepstral Coeficients (MFCC) are extracted from raw audio signals, which are then fed into two typical
pretrained feature representation networks (i.e., ResNet and ConvNeXt).</p>
      <p>(2) We conduct comparative experiments to evaluate the efectiveness of the proposed model.
Particularly, we evaluate diferent ways of initializing the parameters of ResNet and ConvNeXt. Results
show that ResNet backbone with pretrained weights and ConvNeXt with random weights strategy
outperforms others, with scores of 0.751 and 0.771, respectively (Team name: Hathaway Tan, Rank:
1455th).</p>
      <p>The structure of this paper is as follows. Section 2 details the proposed model. Section 3 introduces
the datasets, preprocessing steps and presents experimental results, followed by the conclusion section.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <sec id="sec-2-1">
        <title>2.1. Feature extraction</title>
        <p>
          Considering the synthetic efect of diferent types of acoustic features in analyzing signals[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], we in
this work explore two types of features (Mel-spectrogram and Mel-Frequency Cepstral Coeficients) to
take advantage of multi-channel feature representation.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Pretrained weight analysis</title>
        <p>
          Recent studies indicate that pretrained models in the context of feature mapping have been widely used
because of its efectiveness in accelerating training[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], yet they tend to sufer from limited accuracy due to
the small size of the fine-tuning dataset in the downstream task[ 9]. Hence, we also conduct comparative
experiments to evaluate the impact of pretrained weights on the feature mapping backbones.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental setup and results</title>
      <sec id="sec-3-1">
        <title>3.1. Dataset</title>
        <p>The BirdCLEF2025 training audio dataset consists of a total of 28,564 audios, covering 206 unique species
across four major taxonomic classes. Each audio recording has a duration ranging from 0.54s to 1774s.
Table 1 presents the summary of dataset and Figure 2 displays the top 20 species by training audios.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Experimental setup</title>
        <p>To increase the number of samples for training, we use the sliding window without overlapping to
segment the raw audio data into 10 seconds slices, where zero padding is applied to extend the data
length if the original audio data is shorter than 10s. Finally, we get 78,579 segments in total. Figure 3
shows an example of data segment from an audio file in training dataset.</p>
        <p>We then extract the MFCC and Mel features from each of the segments. Specifically, the MFCC
features consist of the 13-dimensional MFCCs, first-order and second-order derivatives. For
Melspectrogram features, we extract 128 Mel frequency bands per frame, resulting in a 128-dimensional
feature vector. We set the FFT window size to 1024 and the hop length to 512. These parameters define
the time-frequency resolution when computing the MFCC and Mel. We normalize the MFCCs and
Mels sample by sample using Z-Score approach. The MFCCs and Mels are finally reshaped to 224 x 224
pixels.</p>
        <p>For the training procedure, we utilize ResNet50.a1_in1k pretrained on ImageNet-1k dataset[10, 11],
and use ConvNeXtv2_pico.fcmae[12]. The loss function is BCEWithLogitsLoss. The model totally runs
30 epochs with an early-stop strategy avoiding overfitting. We fine-tune the end-to-end model with the
AdamW optimizer. An initial learning rate 0.001 is used and a cosine annealing learning rate scheduler
is utilized, which adjusts the learning rate following a cosine curve from the initial value down to 1e-6.</p>
        <p>As for the performance metric, a version of macro-averaged ROC-AUC that skips classes which have
no true positive labels is used. We employ 3-fold stratified cross-validation for training. During training,
the model achieving the highest average AUC on the validation set in each fold is saved.</p>
        <p>During the test stage, prediction is conducted on the hidden test set, which transforms the model’s
outputs to multi-class probabilities and calculates AUC scores using predictions and real multi-class
labels. The submission format requires that the length of each test audio segment is 5 second, so we
concatenate the 5 second segment with itself to create a 10 second slice. We opt not to train directly
on 5-second segments due to the limited acoustic context they ofer, which can negatively impact
classification performance.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Experimental results</title>
        <p>Tables 2 and 3 present the results of diferent weight-using strategies on 34% and 66% test data
respectively. In the table, “fine-tuned” means the feature mapping networks are equipped with pretrained
weights; “from scratch” indicates that the parameters of feature mapping networks are randomly
initialized.</p>
        <p>The experimental results presented in Tables 2 and 3 show the AUC scores of various model
combinations on two diferent test datasets (34% and 66%). Across both datasets, the model
ResNet_ft(finetuned)+ConvNeXt_fs(from scratch) consistently achieves the highest AUC scores — 0.751 on the
34% test set and 0.771 on the 66% test set — indicating superior performance. In contrast, the model
ResNet_fs(from scratch)+ConvNeXt_ft(fine-tuned) performs the worst. Interestingly, the
combination ResNet_fs(from scratch)+ConvNeXt_ft(fine-tuned) performs worse than using ConvNeXt_fs(from
scratch), which may imply that the feature in ConvNeXt has a less significant or even slightly detrimental
efect when ResNet is not enhanced.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Discussion</title>
        <p>Our study reveals the following performance ordering: ResNet pretrained only &gt; both pretrained
&gt; no pretraining &gt; ConvNeXt pretrained only. These findings shows that visual pretrained models
transfer well to spectrogram’s low-level texture and edge features but require close alignment between
input representation and pretrained domain. Based on our findings, we recommend using pretrained
ImageNet weights only when the input representation retains visual-like structures, such as Mel
spectrograms, which benefit from learned low-level convolutional filters. On the contrary, for more
abstract representations like MFCC, we advise against using pretrained visual weights.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>In this work, we proposed a dual-branch architecture that leverages both Mel-spectrogram and MFCC
features, processed through ResNet and ConvNeXt backbones, for passive acoustic monitoring in the
BirdCLEF2025 task. Comparative experiments on the competition’s test datasets demonstrate the
efectiveness of our design, with the model achieving macro-mean ROC-AUC scores of 0.751 on the 34%
test set and 0.771 on the 66% test set. These results confirm that the proposed end-to-end framework
can efectively capture complementary acoustic information and deliver robust multi-label bird species
classification performance.</p>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used OpenAI-GPT-4o in order to: Grammar and
spelling check. After using this tool, the authors reviewed and edited the content as needed and take
full responsibility for the publication’s content.
[9] H. Liu, M. Long, J. Wang, M. I. Jordan, Towards understanding the transferability of deep
representations, arXiv preprint arXiv:1909.12031 (2019).
[10] R. Wightman, H. Touvron, H. Jégou, Resnet strikes back: An improved training procedure in timm,
arXiv preprint arXiv:2110.00476 (2021).
[11] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of
the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[12] S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, S. Xie, Convnext v2: Co-designing and
scaling convnets with masked autoencoders, in: Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, 2023, pp. 16133–16142.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Thomas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. A.</given-names>
            <surname>Marques</surname>
          </string-name>
          ,
          <article-title>Passive acoustic monitoring for estimating animal density</article-title>
          ,
          <source>Acoustics Today</source>
          <volume>8</volume>
          (
          <year>2012</year>
          )
          <fpage>35</fpage>
          -
          <lpage>44</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Cañas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Demkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          , T. Denton, Birdclef+
          <year>2025</year>
          , https://kaggle.com/ competitions/birdclef-2025,
          <year>2025</year>
          . Kaggle.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Adam</surname>
          </string-name>
          , et al.,
          <source>Overview of lifeclef</source>
          <year>2025</year>
          :
          <article-title>Challenges on species presence prediction and identification, and individual animal identification</article-title>
          ,
          <source>in: International Conference of the Cross-Language Evaluation Forum for European Languages</source>
          , Springer,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Cañas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Toro-Gómez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rodriguez-Buritica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Benavides-Lopez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Ulloa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Caycedo-Rosales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Glotin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-P.</given-names>
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Planqué</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of BirdCLEF+
          <year>2025</year>
          <article-title>: Multi-taxonomic sound identification in the middle magdalena valley, colombia</article-title>
          ,
          <source>in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Miyaguchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cheung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gustineli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>Transfer learning with pseudo multi-label birdcall classification for ds@ gt birdclef 2024</article-title>
          , arXiv preprint arXiv:
          <volume>2407</volume>
          .06291 (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Porwal</surname>
          </string-name>
          ,
          <article-title>Bird-species audio identification, ensembling of eficientnet-b0 and pre-trained eficientnet-b1 model (</article-title>
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Yaseen</surname>
            ,
            <given-names>G.-Y.</given-names>
          </string-name>
          <string-name>
            <surname>Son</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kwon</surname>
          </string-name>
          ,
          <article-title>Classification of heart sound signal using multiple features</article-title>
          ,
          <source>Applied Sciences</source>
          <volume>8</volume>
          (
          <year>2018</year>
          )
          <fpage>2344</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dollár</surname>
          </string-name>
          ,
          <article-title>Rethinking imagenet pre-training</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF international conference on computer vision</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4918</fpage>
          -
          <lpage>4927</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>