<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Improving Bird Recognition using Pseudo-Labeled Recordings from the Target Location</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Mario</forename><surname>Lasseck</surname></persName>
							<email>mario.lasseck@mfn.berlin</email>
							<affiliation key="aff0">
								<orgName type="institution">Museum für Naturkunde Berlin</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">Evaluation Forum</orgName>
								<address>
									<addrLine>September 09-12</addrLine>
									<postCode>2024</postCode>
									<settlement>Grenoble</settlement>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Improving Bird Recognition using Pseudo-Labeled Recordings from the Target Location</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">6F03A66A27DF775EFE57530419C04C71</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:02+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Bird Species Recognition</term>
					<term>Biodiversity Assessment</term>
					<term>Soundscapes</term>
					<term>BirdCLEF</term>
					<term>Deep Learning</term>
					<term>Domain Adaptation</term>
					<term>Pseudo-Labeling</term>
					<term>Semi-Supervised Learning</term>
					<term>Kaggle Competition</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper presents a deep learning approach to identify bird species in soundscape recordings with Convolutional Neural Networks (CNNs). The proposed method employs an iterative process to create pseudo labels for a large number of unlabeled recordings from the target location and applies them during training to significantly improve model performance and address the domain shift between training and test data. The effectiveness of the approach is evaluated in the BirdCLEF 2024 competition hosted on Kaggle, where it achieves a macroaveraged area under the ROC curve (AUC) of 69 % on the official test set. This performance positions the method among the top two systems for identifying birds in wildlife monitoring recordings of the Western Ghats, a major biodiversity hotspot in India.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The BirdCLEF 2024 competition focuses on developing automated systems for detecting and classifying under-studied bird species in the Western Ghats. This mountain range, a global biodiversity hotspot in India, hosts a variety of endemic and endangered species, including many found nowhere else in the world. As the region faces drastic landscape and climatic changes, there's an urgent need for advanced conservation tools to assess and monitor its unique birdlife. The challenge aims to identify native species of the Western Ghats sky-islands, classify rare birds with limited training data and detect elusive nocturnal species. This year's edition introduces several challenges and unique aspects:</p><p>• Participants must address a significant domain shift between the training data, which consists of focal recordings from various locations, and the test data, which comprises soundscapes from the Western Ghats. • The competition imposes a strict time limit for species identification in the test set, adding a practical constraint that mirrors real-world applications to assess and monitor biodiversity. • To aid in bridging the domain gap, an additional unlabeled dataset from the target location is provided, allowing participants to explore un-and semi-supervised learning techniques.</p><p>By improving the accuracy and efficiency of bird identification algorithms under these constraints, this initiative supports ongoing conservation efforts, such as those led by V. V. Robin's Lab at IISER Tirupati <ref type="bibr">[1]</ref>. These innovations will empower researchers and practitioners to more effectively track avian population trends, evaluate threats and refine their conservation strategies in this ecologically crucial region.</p><p>Further details about the BirdCLEF 2024 competition are given in <ref type="bibr" target="#b0">[2]</ref>, [3] and <ref type="bibr" target="#b1">[4]</ref>. The task is part of the LifeCLEF 2024 evaluation campaign <ref type="bibr">[5,</ref><ref type="bibr" target="#b2">6]</ref> and the Conference and Labs of the Evaluation Forum <ref type="bibr" target="#b3">[7,</ref><ref type="bibr" target="#b4">8]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Materials and Methods</head><p>The implementation of the machine learning based system for bird species recognition presented in this paper builds upon solutions for previous BirdCLEF competitions and similar tasks <ref type="bibr" target="#b5">[9,</ref><ref type="bibr" target="#b6">10,</ref><ref type="bibr" target="#b7">11,</ref><ref type="bibr" target="#b9">12,</ref><ref type="bibr" target="#b10">13]</ref>. Further details on own past developments and implementation methods can be found for example in <ref type="bibr" target="#b11">[14]</ref>, <ref type="bibr" target="#b12">[15]</ref>, <ref type="bibr" target="#b13">[16]</ref> and <ref type="bibr" target="#b14">[17]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Datasets</head><p>The BirdCLEF 2024 training data consists of 24459 audio recordings provided by Xeno-canto [18], covering 182 different bird species. Unique to this year's task, an additional 8444 unlabeled recordings are provided from the same location as the test set soundscapes. Table <ref type="table" target="#tab_0">1</ref> provides an overview of the individual datasets and their characteristics. All recordings are resampled to 32 kHz, converted to mono, and compressed to Ogg format.</p><p>Xeno-canto files are weakly labeled, meaning there is no precise information on the presence or absence of the labeled bird within the recording. However, there is a high probability of hearing the labeled bird at the beginning of each audio file, as recordists often trim their recordings accordingly before uploading them. To exploit this characteristic, only the first 5 seconds of recordings are used for training. For some recordings, one or more background species are also provided as secondary labels.</p><p>For cross-validation, the training dataset is split into 5 or 8 stratified randomized folds, ensuring that primary species are proportionally represented in each fold. This system achieves a maximum AUC of 66 % on the public test set. From this baseline, experiments were conducted with different CNN backbones, hyperparameter settings, augmentation methods and input image sizes. A major drawback of the initial model was its relative long submission time of over one hour. In addition to improving the score, one objective was to reduce inference time to fit more models in an ensemble without exceeding the 2-hour submission time limit. To address this, the CNN backbone was replaced with an EfficientNet B0 architecture (tf_efficientnet_b0_ns <ref type="bibr">[34]</ref>) and the Mel spectrogram image was reduced to smaller dimensions. Results were initially unstable, with a public leaderboard score ranging from 62 % to 66 % AUC and were very sensitive to different combinations of Mel parameters and input image sizes. However, with further adjustments, it was possible to create single models with an inference time of around 12 minutes, still achieving a score of approximately 65 % AUC.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Main changes to the initial model included:</head><p>• CNN backbone: tf_efficientnet_b0_ns </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Training Methods</head><p>The training data is divided into 5 or 8 folds, stratified according to primary labels. Only the first 5 seconds of each audio file are used for training. The models are trained using Convolutional Neural Network (CNN) backbones, specifically tf_efficientnet_b0_ns, which are pretrained on ImageNet. The training process employs the AdamW [37] optimizer and a one-cycle CosineAnnealingLR scheduler with a peak learning rate of 1e-3 and 3 warmup epochs. The average of binary cross-entropy and focal loss is used to optimize model performance.</p><p>For validation, the first 5 seconds of the files in the validation set are used to track learning progress through evaluation metrics Label Ranking Average Precision (LRAP) <ref type="bibr">[38]</ref>, cMAP [39], F1 [40] and AUC <ref type="bibr">[41]</ref>. Background species are included with a target value of 1.0 and are treated equally to primary labeled species.</p><p>To enhance model stability and performance, "checkpoint soups" are used for single model inference. This follows the idea of model soups <ref type="bibr" target="#b16">[42]</ref>. But here, weights from different checkpoints of the same model (typically from epochs 13-50) are averaged, provided there is an improvement in local cross-validation scores in at least one of the tracked metrics. This approach leads to more stable and occasionally better performance. For ensemble inference, predictions from several models are combined using simple mean averaging.</p><p>The above-described modifications to the baseline model allowed the creation of an ensemble of six models, achieving 70 % AUC. This ensemble was subsequently used to generate a first set of pseudo labels.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Performance Improvement with Pseudo Labels</head><p>Pseudo labels are created by applying the model ensemble on the unlabeled recordings from the test location. The predictions from all 5-second intervals of the 8444 unlabeled soundscapes form a large set of 401947 soft pseudo labels.</p><p>In the subsequent training stages, randomly selected audio segments from the pseudo-labeled recordings are mixed with the training samples at a probability of 25 to 45 percent. Before combining the audio signals, the amplitudes of both waveforms are multiplied by a random factor. The target vector of the training sample (with a value of 1.0 for primary and secondary species and 0 for others) is combined with the pseudo label vector (containing predicted probabilities) to form the new target vector by taking the maximum value of both.</p><p>Incorporating pseudo labels into training significantly improved scores for both single models and ensembles. The enhanced ensemble was then used to generate a new set of pseudo labels and this cycle was repeated multiple times to progressively improve model and ensemble performance. The iterative pseudo-labeling process is described in Figure <ref type="figure" target="#fig_0">1</ref>. Its impact on public and private leaderboard scores is illustrated in Table <ref type="table" target="#tab_2">2</ref> and visualized in Figure <ref type="figure" target="#fig_1">2</ref>.  After the second iteration, pseudo label values became too large and required normalization by rescaling them back to the range [0,1] to allow stable model training. Unfortunately, the stage 3 ensemble was not selected for final ranking because public leaderboard score did not reveal the expected improvement. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Post-Processing</head><p>Models are ensembled by simply taking the mean of predictions (probabilities from sigmoid outputs) of each individual model. As a final step, for each test file, predictions of a given time window are summed with those of the two neighboring windows using an aggregation factor of 0.5. This postprocessing method was previously applied by Theo Viel and his team in the 3 rd place solution [43] of the Cornell Birdcall Identification competition <ref type="bibr">[44]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Inference Optimizations</head><p>To speed up inference, audio files from the test set are preprocessed in parallel using multithreading. Additionally, different versions of Mel spectrogram images are pre-calculated and reused for different models in the ensemble. By including models that work on smaller image sizes, ensembles of up to six models can run within the 2-hour limit to create predictions for all 1100 recordings in the test set.</p><p>Due to variations in the hardware provided by Kaggle for running inference notebooks, particularly in CPU types, the number of models that could be ensembled to identify all birds in the test set within the given time frame varied. To prevent submission errors, a timer is implemented in the notebook to ensure completion within the 2-hour limit. If the timer reaches approximately 118 minutes, inference is stopped and results are collected for all models and predicted file parts up to that point. Predictions from unfinished models or file parts are masked before averaging.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Results</head><p>The training and pseudo-labeling approach described in this paper secured 2 nd place among a total of 974 participating teams. Final scores on the public and private leaderboards, as well as the ranking of the top 10 teams, are presented in Table <ref type="table" target="#tab_3">3</ref>. By combining several diverse models, a macro-averaged ROC-AUC of 69.035 % was achieved on the complete test set (see team 'adsr' in Table <ref type="table" target="#tab_3">3</ref>). Parameters and performance of the six models from the 2 nd place solution (2 nd stage ensemble in Table <ref type="table" target="#tab_2">2</ref>) are detailed in Table <ref type="table" target="#tab_4">4</ref>. Model diversity in the ensemble is achieved by varying Mel parameters, data subsets, image sizes, the probability of adding pseudo labels and amplitude factors to adjust the volume ratio between training and pseudo-labeled data. The parameters ampExpMin and ampExpMax in Table <ref type="table" target="#tab_4">4</ref> specify the range for the random amplitude factor applied to training and pseudo-label samples to adjust their volume in the mix: ampFactor = 10**(random.uniform(ampExpMin, ampExpMax))  <ref type="table" target="#tab_4">4</ref> is the only one utilizing external data. For this model, additional files for the 182 species in the competition were downloaded from Xeno-canto. The first 5 seconds of each file were added to the training set, with shorter files being padded with zeros to ensure a uniform length.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Discussion</head><p>As in previous editions of the BirdCLEF competition, the challenge was to use focal recordings from Xeno-canto to train a system capable of accurately identifying bird species in soundscapes. The inference time was again limited to 2 hours. However, compared to last year, over twice the amount of data had to be processed within that time (recordings with a total duration of 1 day, 9 hours and 20 minutes in 2023 vs. 3 days, 1 hour and 20 minutes in 2024). This placed even more constraints on the size and number of models that could be used to process all recordings in the test set. Other challenges included the extreme domain shift between training and test data, a significant class imbalance in the training samples (with some classes having only five example recordings per species) and the lack of diversity in the training material for many under-studied species in the target location.</p><p>Fortunately, a large set of unlabeled from the same locations as the test data was provided this year. With this dataset, it was possible to create pseudo labels and find an effective method of incorporating them into training to significantly improve identification performance. The approach described in this paper, using pseudo-labeled data from soundscapes of the deployment location, combines several advantages: For pseudo-labeling, only ensembles that fit the time limit constraint were used for inference. Using larger ensembles or including models with stronger backbones (e.g. with a higher number of layers for feature extraction) would likely lead to better pseudo labels. It would be interesting to investigate in future experiments how much further scores can be improved if stronger pseudo labels are incorporated during training.</p><p>With only two of the best models from the 2 nd place system (models 1 and 2 in Table <ref type="table" target="#tab_4">4</ref>), it is possible to achieve a private leaderboard score of 69.694 % AUC. The combination of these two models takes much less time for inference compared to using all six models. It surpasses the score of the entire ensemble and even the 1 st place system of the competition (69.0391 % AUC). Another interesting finding is that, combined with pseudo-label training, the SED architecture with attention on frequency bands from last year <ref type="bibr" target="#b13">[16]</ref> achieves the best single model score (69.701 % AUC on private leaderboard). This again proves that the feature engineering, network architecture, augmentation techniques and training methods of the BirdCLEF 2023 3 rd place system [45] are quite robust and work well for the data and species sets of this year's task.</p><p>A customized version of the model to identify European bird species is available on GitHub <ref type="bibr">[46]</ref>. It was successfully implemented in a number of tools and projects to assess and monitor avian biodiversity <ref type="bibr" target="#b17">[47,</ref><ref type="bibr" target="#b18">48,</ref><ref type="bibr" target="#b19">49,</ref><ref type="bibr">50,</ref><ref type="bibr">51,</ref><ref type="bibr">52]</ref> and is also part of Naturblick [53], a smartphone application to discover and learn about nature in urban surroundings.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Iterative pseudo-labeling process to improve single model and ensemble performance</figDesc><graphic coords="4,135.70,598.90,317.19,104.50" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Visualization of performance improvement using pseudo labels from different training stages</figDesc><graphic coords="5,85.75,262.32,411.75,186.00" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>1. 2 .</head><label>2</label><figDesc>Noise augmentation: By mixing training samples with samples from the target domain, the model learns how species sound within the environmental background noise of the test site habitat. This helps to address the domain shift between Xeno-canto recordings and test soundscapes. Training data extension: The model receives more training samples representing the noise characteristics and species distribution of the deployment location. 3. Knowledge distillation: Since pseudo labels are derived from predictions of a stronger model (or ensemble of models in this case), its knowledge is transferred during training to the smaller model.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Datasets overview and statistics</figDesc><table><row><cell></cell><cell>Training set</cell><cell>Unlabeled set</cell><cell>Test set</cell></row><row><cell>Recording type</cell><cell>Focal</cell><cell>Soundscape</cell><cell>Soundscape</cell></row><row><cell>Source</cell><cell>Various locations (Xeno-canto)</cell><cell>Western Ghats</cell><cell>Western Ghats</cell></row><row><cell># Recordings</cell><cell>24459</cell><cell>8444</cell><cell>1100</cell></row><row><cell>Min. duration per rec.</cell><cell>0.47s</cell><cell>20s</cell><cell>4m</cell></row><row><cell>Max. duration per rec.</cell><cell>1h 39m 24s</cell><cell>4m</cell><cell>4m</cell></row><row><cell>Acc. duration all rec.</cell><cell>11d 20h 50m 30s</cell><cell cols="2">23d 6h 19m 11s 3d 1h 20m</cell></row><row><cell># Species / Classes</cell><cell>182</cell><cell>unknown</cell><cell>unknown</cell></row><row><cell>Min. # rec. per class</cell><cell>5</cell><cell>unknown</cell><cell>unknown</cell></row><row><cell>Max. # rec. per class</cell><cell>500</cell><cell>unknown</cell><cell>unknown</cell></row></table><note>o CosineAnnealingLR scheduler [28] with 5 warmup epochs [29] o Peak learning rate 1e-4 o 100 epochs with early stopping if AUC is not improving for 7 epochs o Batch size 64 o Average of binary cross-entropy [30] and focal loss [31] as loss function o Generalized-Mean (GeM) pooling • Augmentations: o HorizontalFlip [32] o CoarseDropout [33] o Mixup of Mel spectrogram images within training batches</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2 :</head><label>2</label><figDesc>Performance improvement using pseudo labels from different training stages</figDesc><table><row><cell cols="2">Stage Pseudo labels</cell><cell>Single model (ID 4)</cell><cell>Ensemble</cell></row><row><cell></cell><cell></cell><cell>publ. | priv. LB AUC [%]</cell><cell>publ. | priv. LB AUC [%]</cell></row><row><cell>0</cell><cell>-</cell><cell>65.735 | 59.270</cell><cell>70.065 | 61.738</cell></row><row><cell>1</cell><cell>From stage 0 ensemble</cell><cell>69.165 | 66.119</cell><cell>71.090 | 67.084</cell></row><row><cell>2</cell><cell>From stage 1 ensemble</cell><cell>69.936 | 67.445</cell><cell>72.528 | 69.035</cell></row><row><cell>3</cell><cell>From stage 2 ens. (normalized)</cell><cell>71.154 | 67.683</cell><cell>71.716 | 69.527</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 3 :</head><label>3</label><figDesc>Competition results of the top 10 teams (with solution of team 'adsr' described in this paper)</figDesc><table><row><cell cols="2">Rank Team Name on Kaggle</cell><cell>AUC [%]</cell><cell>AUC [%]</cell></row><row><cell></cell><cell></cell><cell>(publ. LB)</cell><cell>(priv. LB)</cell></row><row><cell>1</cell><cell>Team Kefir</cell><cell>73.857</cell><cell>69.039</cell></row><row><cell>2</cell><cell>adsr</cell><cell>72.794</cell><cell>69.035</cell></row><row><cell>3</cell><cell>NVBird</cell><cell>74.212</cell><cell>68.997</cell></row><row><cell>4</cell><cell>Team Cerberus</cell><cell>74.691</cell><cell>68.777</cell></row><row><cell>5</cell><cell>coolz</cell><cell>74.396</cell><cell>68.717</cell></row><row><cell>6</cell><cell>penguin46</cell><cell>72.039</cell><cell>68.716</cell></row><row><cell>7</cell><cell>Team Unicorn</cell><cell>72.809</cell><cell>68.383</cell></row><row><cell>8</cell><cell>kapenon</cell><cell>69.660</cell><cell>67.928</cell></row><row><cell>9</cell><cell>Aphysict</cell><cell>71.453</cell><cell>67.891</cell></row><row><cell>10</cell><cell>Tamo</cell><cell>70.132</cell><cell>67.623</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 4 :</head><label>4</label><figDesc>Single model parameters and performances of the 2 nd place ensemble</figDesc><table><row><cell>Params. / Model ID</cell><cell>1</cell><cell>2</cell><cell>3</cell><cell>4</cell><cell>5</cell><cell>6</cell></row><row><cell>seed</cell><cell>42</cell><cell>42</cell><cell>42</cell><cell>42</cell><cell>70</cell><cell>42</cell></row><row><cell>n_folds</cell><cell>5</cell><cell>5</cell><cell>5</cell><cell>5</cell><cell>10</cell><cell>5</cell></row><row><cell>fold</cell><cell>4</cell><cell>1</cell><cell>4</cell><cell>4</cell><cell>0</cell><cell>4</cell></row><row><cell>dataset</cell><cell>bc24</cell><cell>bc24</cell><cell>bc24</cell><cell>bc24</cell><cell>bc24+</cell><cell>bc24</cell></row><row><cell>n_mels</cell><cell>128</cell><cell>128</cell><cell>128</cell><cell>64</cell><cell>64</cell><cell>64</cell></row><row><cell>hop_length</cell><cell>512</cell><cell>512</cell><cell>1024</cell><cell>1024</cell><cell>1024</cell><cell>1024</cell></row><row><cell>image_height</cell><cell>256</cell><cell>256</cell><cell>128</cell><cell>64</cell><cell>64</cell><cell>64</cell></row><row><cell>image_width</cell><cell>256</cell><cell>256</cell><cell>128</cell><cell>128</cell><cell>128</cell><cell>64</cell></row><row><cell>pseudoLabelChance [%]</cell><cell>35</cell><cell>40</cell><cell>45</cell><cell>30</cell><cell>30</cell><cell>25</cell></row><row><cell>ampExpMin</cell><cell>-0.5</cell><cell>-1.0</cell><cell>-0.5</cell><cell>-0.5</cell><cell>-0.5</cell><cell>-0.5</cell></row><row><cell>ampExpMax</cell><cell>0.1</cell><cell>0.2</cell><cell>0.1</cell><cell>0.1</cell><cell>0.1</cell><cell>0.1</cell></row><row><cell>Inference time</cell><cell cols="6">~ 50 min. ~ 50 min. ~ 17 min. ~ 12 min. ~ 12 min. ~ 11 min.</cell></row><row><cell>Public LB AUC [%]</cell><cell>73.270</cell><cell>71.975</cell><cell>71.104</cell><cell>69.936</cell><cell>69.124</cell><cell>69.309</cell></row><row><cell>Private LB AUC [%]</cell><cell>68.521</cell><cell>68.533</cell><cell>68.116</cell><cell>67.445</cell><cell>64.543</cell><cell>65.862</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Acknowledgements</head><p>I would like to thank Stefan Kahl, Holger Klinck, Maggie, Sohier Dane, Tom Denton, Vijay Ramesh, Maximilian Eibl, Chiti Arvind, Harikrishnan C.P., Viral Joshi, V.V. Robin, Suyash Sawant, Alexis Joly, Henning Müller, Divya Mudappa, T.R. Shankar Raman, Meghana Srivathsa, Akshay V. Anand, Willem-Pier Vellinga and all involved institutions and individual contributors (Kaggle, Chemnitz University of Technology, Columbia University, Google Research, Indian Institute of Science Education and Research Tirupati, K. Lisa Yang Center for Conservation Bioacoustics, LifeCLEF, Nature Conservation Foundation, Parry Agro Industries Ltd., Project Dhvani, Tamil Nadu Forest Department, Tata Coffee Ltd., Tea Estates India Ltd., The Rufford Foundation, The University of Florida and Xeno-canto) for organizing this competition.</p><p>I also want to thank the Museum für Naturkunde and the team of the Animal Sound Archive Berlin [54] in particular Karl-Heinz Frommolt, Olaf Jahn and Benjamin Werner for supporting my work. The research was partly funded by the BMEL (Bundesministerium für Ernährung und Landwirtschaft) within the project "Machbarkeitsstudie -Integration (bio-)akustischer Methoden zur Quantifizierung biologischer Vielfalt in das Waldmonitoring" (FKZ: 2221NR050B).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Klinck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Maggie</forename><surname>Dane</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kahl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Denton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename></persName>
		</author>
		<author>
			<persName><forename type="first">Ramesh</forename><forename type="middle">V</forename></persName>
		</author>
		<ptr target="https://kaggle.com/competitions/birdclef-2024" />
		<title level="m">BirdCLEF</title>
				<imprint>
			<date type="published" when="2024">2024. 2024</date>
		</imprint>
	</monogr>
	<note>Kaggle</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Overview of BirdCLEF 2024: Acoustic identification of under-studied bird species in the Western Ghats</title>
		<author>
			<persName><forename type="first">S</forename><surname>Kahl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Denton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Klinck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Ramesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Joshi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Srivathsa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Anand</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Arvind</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">P</forename><surname>Harikrishnan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Sawant</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Robin</forename><surname>Vv</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Glotin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Goëau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">P</forename><surname>Vellinga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Planqué</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Joly</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes of CLEF 2024 -Conference and Labs of the Evaluation Forum</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Overview of lifeclef 2024: Challenges on Species Distribution Prediction and Identification</title>
		<author>
			<persName><forename type="first">A</forename><surname>Joly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Picek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kahl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Goëau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Espitalier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Botella</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Deneu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Marcos</forename><forename type="middle">D</forename><surname>Estopinan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Leblanc</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Larcher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Šulc</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hrúz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Servajean</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference of the Cross-Language Evaluation Forum for European Languages</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2024">2024. 2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Working Notes of CLEF 2024 -Conference and Labs of the Evaluation Forum</title>
		<author>
			<persName><forename type="first">G</forename><surname>Faggioli</surname></persName>
		</author>
		<author>
			<persName><surname>Ferro</surname></persName>
		</author>
		<editor>N, Galuščáková P, García Seco de Herrera A</editor>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Experimental IR Meets Multilinguality, Multimodality, and Interaction</title>
		<author>
			<persName><forename type="first">L</forename><surname>Goeuriot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mulhem</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Quénot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Schwab</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Soulier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Di</forename><surname>Nunzio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">M</forename><surname>Galuščáková</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename></persName>
		</author>
		<author>
			<persName><forename type="first">García</forename><surname>Seco De Herrera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Faggioli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Ferro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Fifteenth International Conference of the CLEF Association</title>
				<meeting>the Fifteenth International Conference of the CLEF Association<address><addrLine>CLEF</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024">2024. 2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Audio based bird species identification using deep learning techniques</title>
		<author>
			<persName><forename type="first">E</forename><surname>Sprengel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Jaggi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Kilcher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Hofmann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CEUR Workshop Proceedings</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Large-Scale Bird Sound Classification using Convolutional Neural Networks</title>
		<author>
			<persName><forename type="first">S</forename><surname>Kahl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Wilhelm-Stein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Hussein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CEUR Workshop Proceedings</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Two Convolutional Neural Networks for Bird Detection in Audio Signals</title>
		<author>
			<persName><forename type="first">T</forename><surname>Grill</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schlüter</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">25th European Signal Processing Conference</title>
				<imprint>
			<date type="published" when="2017">2017. EUSIPCO2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">Greece</forename><surname>Kos</surname></persName>
		</author>
		<idno type="DOI">10.23919/EUSIPCO.2017.8081512</idno>
		<ptr target="https://doi.org/10.23919/EUSIPCO.2017.8081512" />
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Audio bird classification with inception-v4 extended with time and timefrequency attention mechanisms</title>
		<author>
			<persName><forename type="first">A</forename><surname>Sevilla</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Glotin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CEUR Workshop Proceedings</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Automatic acoustic detection of birds through deep learning: the first Bird Audio Detection challenge</title>
		<author>
			<persName><forename type="first">D</forename><surname>Stowell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Stylianou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wood</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Pamuła</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Glotin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Methods in Ecology and Evolution</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Audio-based Bird Species Identification with Deep Convolutional Neural Networks</title>
		<author>
			<persName><forename type="first">M</forename><surname>Lasseck</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CEUR Workshop Proceedings</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Acoustic Bird Detection with Deep Convolutional Neural Networks</title>
		<author>
			<persName><forename type="first">M</forename><surname>Lasseck</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018)</title>
				<editor>
			<persName><forename type="first">M</forename><forename type="middle">D</forename><surname>Plumbley</surname></persName>
		</editor>
		<meeting>the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018)</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="143" to="147" />
		</imprint>
		<respStmt>
			<orgName>Tampere University of Technology</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Bird Species Identification in Soundscapes</title>
		<author>
			<persName><forename type="first">M</forename><surname>Lasseck</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CEUR Workshop Proceedings</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Bird Species Recognition using Convolutional Neural Networks with Attention on Frequency Bands</title>
		<author>
			<persName><forename type="first">M</forename><surname>Lasseck</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CEUR Workshop Proceedings</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Imagenet: A largescale hierarchical image database</title>
		<author>
			<persName><forename type="first">J</forename><surname>Deng</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Conference on Computer Vision and Pattern Recognition</title>
				<imprint>
			<date type="published" when="2009">2009. 2009</date>
			<biblScope unit="page" from="248" to="255" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<author>
			<persName><surname>Wortsman</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2203.05482</idno>
		<title level="m">Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Evaluation of acoustic pattern recognition of nightingale (Luscinia megarhynchos) recordings by citizens</title>
		<author>
			<persName><forename type="first">M</forename><surname>Stehle</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lasseck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Khorramshahi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Sturm</surname></persName>
		</author>
		<idno type="DOI">10.3897/rio.6.e50233</idno>
	</analytic>
	<monogr>
		<title level="j">Research Ideas and Outcomes</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="page">e50233</biblScope>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Towards a multisensor station for automated biodiversity monitoring</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Wägele</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Bodesheim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">J</forename><surname>Bourlat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Denzler</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.baae.2022.01.003</idno>
	</analytic>
	<monogr>
		<title level="j">Basic and Applied Ecology</title>
		<imprint>
			<biblScope unit="volume">59</biblScope>
			<biblScope unit="page" from="105" to="138" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Weather stations for biodiversity: a comprehensive approach to an automated and modular monitoring system</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Wägele</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">F</forename><surname>Tschan</surname></persName>
		</author>
		<idno type="DOI">10.3897/ab.e119534</idno>
		<ptr target="https://doi.org/10.3897/ab.e119534" />
	</analytic>
	<monogr>
		<title level="m">Advanced Books</title>
				<meeting><address><addrLine>Sofia</addrLine></address></meeting>
		<imprint>
			<publisher>Pensoft</publisher>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="1" to="218" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
