<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Recognizing Bird Species in Audio Files Using Transfer Learning FHDO Biomedical Computer Science Group (BCSG)</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Andreas</forename><surname>Fritzler</surname></persName>
							<email>andreas.fritzler@stud.fh-dortmund.de</email>
							<affiliation key="aff0">
								<orgName type="department">University of Applied Sciences and Arts Dortmund (FHDO) Department of Computer Science</orgName>
								<address>
									<addrLine>Emil-Figge-Strasse 42</addrLine>
									<postCode>44227</postCode>
									<settlement>Dortmund</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Sven</forename><surname>Koitka</surname></persName>
							<email>sven.koitka@fh-dortmund.de</email>
							<affiliation key="aff0">
								<orgName type="department">University of Applied Sciences and Arts Dortmund (FHDO) Department of Computer Science</orgName>
								<address>
									<addrLine>Emil-Figge-Strasse 42</addrLine>
									<postCode>44227</postCode>
									<settlement>Dortmund</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">TU Dortmund University</orgName>
								<address>
									<addrLine>Otto-Hahn-Str. 14</addrLine>
									<postCode>44227</postCode>
									<settlement>Dortmund</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Christoph</forename><forename type="middle">M</forename><surname>Friedrich</surname></persName>
							<email>christoph.friedrich@fh-dortmund.de</email>
							<affiliation key="aff0">
								<orgName type="department">University of Applied Sciences and Arts Dortmund (FHDO) Department of Computer Science</orgName>
								<address>
									<addrLine>Emil-Figge-Strasse 42</addrLine>
									<postCode>44227</postCode>
									<settlement>Dortmund</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Recognizing Bird Species in Audio Files Using Transfer Learning FHDO Biomedical Computer Science Group (BCSG)</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">E953479B322133661F4DA04D7287EB64</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T20:30+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Bird Species Identification</term>
					<term>BirdCLEF</term>
					<term>Audio</term>
					<term>Short Term Fourier Transform</term>
					<term>Convolutional Neural Network</term>
					<term>Transfer Learning</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this paper, a method to identify bird species in audio recordings is presented. For this purpose, a pre-trained Inception-v3 convolutional neural network was used. The network was fine-tuned on 36,492 audio recordings representing 1,500 bird species in the context of the BirdCLEF 2017 task. Audio records were transformed into spectrograms and further processed by applying bandpass filtering, noise filtering, and silent region removal. For data augmentation purposes, time shifting, time stretching, pitch shifting, and pitch stretching were applied. This paper shows that fine-tuning a pre-trained convolutional neural network performs better than training a neural network from scratch. Domain adaptation from image to audio domain could be successfully applied. The networks' results were evaluated in the BirdCLEF 2017 task and achieved an official mean average precision (MAP) score of 0.567 for traditional records and a MAP score of 0.496 for records with background species on the test dataset.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Since 2014, a competition called BirdCLEF is hosted every year by the LifeCLEF lab <ref type="bibr" target="#b4">[5]</ref>. The LifeCLEF lab is part of the "Conference and Labs of the Evaluation Forum" (CLEF). The goal of the competition is to identify bird species in audio recordings. The difficulty of the competition increases every year. This year, in the BirdCLEF 2017 task <ref type="bibr" target="#b1">[2]</ref>, 1,500 bird species had to be identified. The training dataset was built from the Xeno-canto collaborative database <ref type="foot" target="#foot_0">3</ref> and consists of 36,492 audio recordings. These records are highly diverse according to sample rate, length, and the quality of their content. The test dataset comprises 13,272 audio recordings.</p><p>In 2016, a deep learning approach was applied by <ref type="bibr" target="#b16">[17]</ref> to the bird identification task and outperformed other competitors. In this research, a similar method, inspired by the last year's winner is used with an additional extension. Transfer learning <ref type="bibr" target="#b10">[11]</ref> is applied by using a pre-trained Inception-v3 <ref type="bibr" target="#b18">[19]</ref> convolutional neural network. Related works of identifying bird species in audio recordings in the BirdCLEF 2016 task <ref type="bibr" target="#b2">[3]</ref> can be found in <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b11">12,</ref><ref type="bibr" target="#b13">14,</ref><ref type="bibr" target="#b16">17,</ref><ref type="bibr" target="#b19">20]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Methodology</head><p>To solve the BirdCLEF 2017 task, a convolutional neural network on audio spectrograms was used. The main methodology was oriented on the winner <ref type="bibr" target="#b16">[17]</ref> of the BirdCLEF 2016 task. The concept of their preprocessing method was partially used. The following sections describe the workflow and parameters in an abstract way, details on the parameters for the runs are given in Section 3. data augmentation was applied that includes time shifting, time stretching using factors in the range [0.85, 1.15), pitch shifting, and pitch stretching using percentages in the set {0, . . . , 8}.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Overview</head><p>The whole training was done in three phases. In the first phase, the top layers of the pre-trained model were deleted <ref type="foot" target="#foot_3">6</ref> and trained from scratch leaving the rest of the model fixed. The reason for this is to adjust the number of output classes from the pre-trained network with 1,000 classes to 1,500 species. Afterward, the second phase was started, and the whole model was fine-tuned including all trainable weights. Throughout the whole training during the second phase snapshots of the model were validated every few epochs with pictures that were transformed from the validation set. This way the models' progress according to the MAP score was monitored. It was done to recognize overfitting. After the second phase, a snapshot with the best-monitored MAP score was selected for a third training phase. In this phase, image files from the full training set were used to fine-tune the model further. When the third step was finished, the model was ready to classify test files.</p><p>Finally, the BirdCLEF 2017 test dataset was preprocessed in a similar but not an identical manner as the full training dataset. Details are described later in this Section. During preprocessing, every audio file was transformed into many picture files. In the prediction phase, a fixed region was cropped from the center of every picture file and was predicted by the fully trained model. The predictions were combined by averaging all image segments per audio file for final results. In addition, time-coded soundscapes were grouped in ranges of 5 seconds. The predictions were ordered in descending order per audio file. Furthermore, predictions in time-coded soundscapes were ordered per 5-second regions. In the end, a result file was generated.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Preprocessing for Training</head><p>The progress of the following described preprocessing steps can be seen in  Extracting Frequency Domain Representation A frequency domain representation was generated for all of the audio files using Short-Term Fourier Transform (STFT) <ref type="bibr" target="#b0">[1]</ref>. For this purpose, a Java library "Open Intelligent Multimedia Analysis for Java" (OpenIMAJ) <ref type="foot" target="#foot_4">7</ref> [4] version 1.3.5 was used. It is available under the New BSD License, and it is able to process .wav and also .mp3 audio files. Unfortunately, OpenIMAJ does not support sample overlapping in an easy way by itself, so it had to be implemented. Furthermore, it seems OpenIMAJ is not capable of processing audio files with a bit depth of 24 bits. Two time-coded soundscape audio files 8 in the test dataset were converted from a bit depth of 24 bits to 16 bits with the python library "librosa" version 0.5.0 <ref type="bibr" target="#b8">[9]</ref>, that is available 9 under the ISC License.</p><p>Audio files in BirdCLEF 2017 datasets have different sample rates thus the window size (amount of samples) that was used for the STFT depended on the file's sample rate. For a sample rate of 44.1 kHz, a length of 512 samples was used to create a slice of 256 frequency bands (later on the vertical axis of an image). One slice represents a time interval of approximately 11.6 ms. For a file with a different sample rate, the size of the window was adjusted to match the time interval of 11.6 ms. Audio files were padded with zeros if their last window had fewer samples than were needed for the transform.</p><p>The extracted frequency domain representation is a matrix. Its elements were normalized to the range [0, 1]. Every element of this matrix represents a pixel in the exported image. The logarithm of the elements was not taken, but instead, the values were processed in a linear manner. The matrix was further processed using different methods to remove unnecessary information to reduce its size.</p><p>Bandpass filtering A frequency histogram of the full training set is shown in Figure <ref type="figure">3</ref>. Most of the frequencies below 500 Hz are dominated by noises, for example, wind or mechanical vibration. This circumstance explains the peak in the lower frequency range. It was determined by manually examining 20 files that were randomly selected from the full training set.</p><p>One previous work <ref type="bibr" target="#b9">[10]</ref> removed frequencies under 1 kHz. Audio recordings were in 16 kHz PCM format. The authors in <ref type="bibr" target="#b19">[20]</ref> participated in the BirdCLEF 2016 task and used a low-pass filter with a cutoff frequency of 6,250 Hz.</p><p>In this research, a lower frequency limit of 1,000 Hz and an upper frequency limit of 12,025 Hz was used for bandpass filtering. This reduced the 256 frequency bands by half to 128 bands.  Noise Filtering Median Clipping was applied to reduce noise like wind blowing. This method was also used by the winner <ref type="bibr" target="#b16">[17]</ref> of BirdCLEF 2016 task and formerly by <ref type="bibr" target="#b6">[7]</ref>. It selects all of the elements in the matrix whose values are three times bigger than their corresponding row (frequency band) median and three times larger than their corresponding column (time frame) median. The other elements are set to zero. Afterward, tiny objects were removed. If all of the 8 neighbor elements of an element were zeros, then the element itself was also set to zero.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Silent Region Removal</head><p>The authors in <ref type="bibr" target="#b16">[17]</ref> used signal to noise separation to extract bird calls from audio files. In this research, regions with less information were deleted to retain bird calls in the following way. If the average of a fixed region did not reach a threshold, then the region was removed. Every column was examined on its own. In every column, the number of non-zero elements was counted and normalized by the total number of elements in each column. For this procedure, a threshold of 0.01 was used. After this step, the resulting matrix could have just a few or even zero columns.</p><p>In the end, if the resulting matrix had less than 32 columns, the audio file was completely discarded from training.</p><p>Exporting Image Files Images were exported using a fixed resolution. If after the previous processing steps a matrix had fewer columns than the defined target width of a picture then the matrix was padded to the desired amount of columns and its available content was looped into the padded area.</p><p>The completely processed frequency representation was segmented into equalsized pieces of a fixed length and a predefined overlapping factor. The matrices' elements were in the range [0, 1] and were scaled by a constant factor as well as clamped to the maximum value of 255. The elements were used for all of the three channels in the final picture. As a result, the three channels contained the same information.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Preprocessing for Prediction</head><p>During the preprocessing of the BirdCLEF 2017 test dataset, one exception was made to time-coded soundscapes. On these files, silent region removal was not applied to preserve their full length. Furthermore, no audio files were discarded if they had less than 32 columns in their matrix.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4">Data Augmentation</head><p>Due to the input dimension of Inception-v3 (299x299x3) the generated picture files were processed at this stage before they were forwarded to train the model. This was done by cropping a region from the original image. First, a target cropping location was computed with a jitter for the vertical axis (random y offset). Next, time shifting was applied by moving the starting x position randomly along the x-axis. Then, time stretching was used by moving the target width by a random factor in the range [0.85, 1.15). After that, pitch shifting was combined with pitch stretching and was calculated by moving the starting y position randomly. The target height was reduced randomly the same way. The maximum amount of pitch stretch was 8% in total. The calculated region was cropped from the original picture and was scaled with bilinear interpolation to a size of 299x299 pixels on all of the 3 channels (red, green, blue) to match the input dimension of Inception-v3. Figure <ref type="figure" target="#fig_4">4</ref> shows this procedure visually. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Run Details and Results</head><p>Although more recent network architectures exist like Inception-v4 <ref type="bibr" target="#b17">[18]</ref> and Inception-ResNet-v2 <ref type="bibr" target="#b17">[18]</ref> which might improve the results in comparison to Inception-v3, the former ones were not used for this research because they are slower than the Inception-v3. The former ones are also available as pre-trained models <ref type="foot" target="#foot_5">10</ref> and are potential candidates for future work. Four runs were submitted in total. Three runs used slightly different methods of preprocessing, and the fourth run combined the results of the former three runs by averaging them.</p><p>First, binary run (Run 2) was created with the preprocessing pipeline (compare Section 2.2) and binary images. Next, grayscale run (Run 4) was created with a few changes to binary run (Run 2) to examine the differences in MAP scores in comparison to binary run. Lastly, big run (Run 1) was designed by improving some parts of the previous runs and correcting some mistakes. The runs were submitted in alphabetical order according to their description names thus the run's details in this Section does not follow the run's number but rather their temporal creation time.</p><p>Training was done on one NVIDIA Tesla K80 graphics card that contains 2 GPUs with 12 GB of RAM each. A mini-batch size of 32 was used per GPU, which results in an effective batch size of 64. Fine-tuning of a single model until the stage of prediction took several days. The machine was used non-exclusively.</p><p>Predicting was done on one NVIDIA Titan X Pascal GPU.</p><p>Table <ref type="table" target="#tab_1">1</ref> shows the runs' achieved results measured in MAP score on the reduced training set and the validation set using all predictions. To show the advantages of transfer learning, all of the runs were executed twice with identical parameters. On the one hand a pre-trained Inception-v3 was used, and on the other hand, the Inception-v3 was trained from scratch. Results in Table <ref type="table" target="#tab_1">1</ref> show that fine-tuning a pre-trained convolutional neural network performs better than training a neural network from scratch, although pre-training was done on another domain. In addition, official results on the BirdCLEF 2017 test dataset of the submitted runs are stated as well. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Binary Run: Run 2</head><p>The following Section describes only additions and differences compared to the description in Section 2.</p><p>Preprocessing STFT used 512 samples without sample overlapping. After the step noise filtering, all of the elements in the matrix greater than 0 were set to 1 to create a monochrome picture file. After silent region removal, 45 audio files were discarded from training.</p><p>Images were exported using a resolution of 256 pixels in width and 128 pixels in height. One image file represents a length of 2.97 s. For this purpose, the previously generated matrices were segmented into equal-sized fragments of 256 pixels in width with an overlapping factor of 7  8 . Before matrices were exported to pictures, their elements were multiplied by 255. The resulting values were used for all of the three channels in a picture. The reduced training set led to 1,365,849 picture files (2.5 GiB). From the validation set, 145,724 image files were generated (282.6 MiB). The test dataset produced 1,583,771 picture files (2.66 GiB).</p><p>Training and Data Augmentation Learning rates were fixed in this run. The top layers of Inception-v3 were trained for 1.48 epochs with a learning rate of 0.01. Training on the reduced training set was done for 15.8 epochs with a learning rate of 0.0002. A MAP score of 0.487 was achieved on the validation set. After that, the full training set was used for training for another 4.28 epochs with a learning rate of 0.0002.</p><p>During data augmentation, a region of 128 pixels in width (±15%) and 128 pixels in height (−8%) should have been randomly cropped.</p><p>Predicting In the predicting phase, a region of 128x128 pixels was cropped from the center of every picture file. The cropped length of 128 pixels corresponds to a time interval of 1.49 s.</p><p>Mistakes In this run, data augmentation was implemented incorrectly. No randomness was used. When training was started then the parameters for time shifting, time stretching, and pitch shifting were generated in a random manner, but these values were always the same as long as training was not restarted.</p><p>The model reached a phase of overfitting. Because the best checkpoint according to MAP score was not saved, an overfitted version of the model was used to complete the BirdCLEF task. The best-monitored MAP score of the lost checkpoint was 0.511 after 8 epochs of training.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Grayscale Run: Run 4</head><p>This run was almost the same as binary run (Run 2). Here only differences to binary run (Run 2) are described.</p><p>Preprocessing In the preprocessing step, there were only two differences compared to binary run (Run 2). First, the frequency domain representation in the range [0, 1] was used without being transformed into zeros and ones. Second, before image files were exported, the elements of the matrices were multiplied by 2,000 and cut off at value 255. This led to picture files that contained grayscale information. Everything else in the preprocessing pipeline was left unchanged. The number of files compared to binary run (Run 2) had not changed, but the file size had increased. The reduced training set had a size of 7.4 GiB, the validation set consisted of 812 MiB, and the test set counted 7.25 GiB.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Training and Data Augmentation</head><p>The top layers of Inception-v3 were trained for 1.74 epochs with a fixed learning rate of 0.02. Afterward, all layers were trained using an exponential learning rate. The learning rate descended smoothly. A staircase function was not used. As training had started, the learning rate had a value of 0.005. After 5.4 epochs, the learning rate reached a value of 0.0003, and a MAP score of 0.541 was achieved on the validation set. Unfortunately, training was restarted every few epochs to slightly adjust the learning rate. Afterward, training was started on the full training set for another 2.6 epochs with an exponential learning rate, starting at 0.0002 and ending at 0.0001.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Mistakes</head><p>The same mistakes as they were made in the binary run (Run 2) were also made in this run. Data augmentation was not working properly. This led to an overfitted model after 6 epochs of training. Training was restarted every few epochs to correct the learning rate. As a side effect, the model was trained on more different pictures than the model in the binary run (Run 2).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Big Run: Run 1</head><p>The name big run is derived from the size of pictures that were generated in the preprocessing step. Pictures were created by processing each channel (red, green, blue) differently. After 7 epochs of fine-tuning, this model had a MAP score of 0.531. Due to the deadline of the BirdCLEF 2017 task, this model could not be trained completely as planned. One can assume that if this model was trained for more epochs, the MAP score should become a little bit better because data augmentation mistakes from the previously made models were corrected.</p><p>Preprocessing STFT used a window size of 942 samples. A slice of 471 frequency bands was generated this way. This slice represents a time interval of approximately 21.4 ms. Furthermore, sample overlapping of 75% was used.</p><p>Bandpass filtering used a lower frequency limit of 900 Hz and an upper frequency limit of 15,100 Hz. This reduced the 471 frequency bands to 303 bands.</p><p>Before the method described in silent region removal was applied, two other processing steps were executed. First, all of the elements in the first 50 columns (approximately 0.27 s) were examined. That means the arithmetic mean of that region was calculated. If the calculated value did not reach a threshold of 0.0001, then the whole region was discarded. Otherwise, the region to be examined was shifted with 75% overlapping. This was repeated throughout the whole matrix. Very silent regions of an audio signal were deleted this way. Second, every column was examined on its own. If the arithmetic mean of a column did not reach a threshold of 0.0001, then the column was removed using a special treatment. Up to three sequenced columns may have each an average value below the threshold. These columns were not deleted. Up to three following columns were set to zero if each of their averages was also below the threshold. All subsequent columns each with an average below the threshold were removed. This procedure separated parts with much audio information visually even more from each other while quiet frames were deleted. After these two steps, the process described in silent region removal was applied. In the end, 7 audio files were discarded from training.</p><p>Images were exported using a resolution of 450 pixels in width and 303 pixels in height. The width of 450 pixels represents a length of approximately 2.4 s.</p><p>The completely processed frequency representation was segmented into equalsized pieces with a length of 450 columns and an overlapping factor of 2  3 . The matrices' were multiplied by 1,000 and then cut off at 255. The result was copied to three matrices. Each matrix represents a color channel of the final picture. One matrix (red channel) was blurred using Gaussian blur <ref type="bibr" target="#b15">[16]</ref> with a radius of 4. Another matrix (blue channel) was sharpened using CLAHE algorithm <ref type="bibr" target="#b12">[13]</ref>. A block radius of 10 and 32 bins were used. The third matrix (green channel) was left untouched. An example of the three differently processed channels is shown in Figure <ref type="figure">5</ref>.</p><p>The reduced training set was transformed into 816,421 image files (23.3 GiB), the validation set has produced 87,448 image files (2.5 GiB), and the test set was converted to 932,573 images (24.4 GiB).</p><p>original (green channel) blurred (red channel) sharpened (blue channel) combined (red, green, blue) Fig. <ref type="figure">5</ref>: Visualization of the generated channels as well as the final composed image. For better visualization the spectrogram was not preprocessed.</p><p>Data Augmentation A target cropping location was computed with a jitter of 4 pixels (∆ y ∈ {0, . . . , 4}). At this point, the target region had a shape of 299x299 pixels. Time stretching manipulated the target width. Pitch shifting and pitch stretching were applied by moving the starting y position randomly by 0, 3, 6, 9, or 12 pixels (that corresponds to percentages in the set {0, . . . , 4}). Target height was manipulated the same way.</p><p>Training During the first phase of training, a learning rate of 0.02 was used for 1 epoch, and a rate of 0.01 was used for a second epoch. After that, the second phase was started with a learning rate of 0.0008. In the second phase, the learning rate was exponentially decreased by a staircase function. That means the rate was adjusted after every epoch was fully completed. A learning rate decay value of 0.7 for every completed epoch was used. After 7 epochs, the model reached a learning rate of 0.000066. A MAP score of 0.531 was achieved on the validation set. The third phase was started using a fixed learning rate of 0.0002 for another 1.98 epochs.</p><p>Predicting In the prediction phase, a region of 299x299 pixels was cropped from the center of every picture file and was predicted by the fully trained model. 299 pixels represent a length of 1.6 s.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Combined Run: Run 3</head><p>Two different methods of combining predictions <ref type="bibr" target="#b5">[6]</ref> were tried in every run when predictions of picture files were combined to create a prediction of an audio file. Calculating the arithmetic mean was one method. The other method was majority voting. This can be explained in the following way: a prediction of a picture is an expert. One asks all of the experts of an audio file to vote for a single target class. The class with the maximum number of votes is the predicted class. Calculating the arithmetic mean always performed better. Its MAP score had a relative difference of 1%-10% compared to the MAP score of majority voting. Run 3 had not a separate model that was used to predict test audio files but rather the predictions of the test dataset of the other three runs were combined. This was done by averaging the predictions of every single picture file that belongs to one audio file. The combination of results of every model after the second training phase led to a MAP score of 0.598.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Conclusion and Future Work</head><p>An approach to identify bird species in audio recordings was shown. For this purpose, a preprocessing pipeline was created and a pre-trained Inception-v3 convolutional neural network was fine-tuned. It could be shown that fine-tuning a pre-trained convolutional neural network leads to better results than training a neural network from scratch. It is remarkable, that this type of transfer learning is even working from the image to the audio domain.</p><p>Unfortunately, the error-free model was not trained long enough to show its full potential. The models presented in this paper reached fair results in the context of the competition and leave room for improvement. A possible enhancement concerns the preprocessing pipeline and data augmentation. Future works should consider transferring the preprocessed frequency domain representation to a convolutional neural network avoiding the use of picture files.</p><p>Furthermore, this research has not focused on identifying bird species in soundscapes. The winner team of the BirdCLEF 2016 task has extracted noisy parts from audio files and mixed them into other audio files. Additionally, a sound effects library with many different ambient noises recorded in nature could be used. This could increase the diversity of the training files during the phase of data augmentation further. This approach was not implemented in this research due to time limitations.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 :</head><label>1</label><figDesc>Fig. 1: Visualization of the model creation pipeline.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 2 :</head><label>2</label><figDesc>Fig. 2: Visualization of the preprocessing pipeline. The STFT spectrograms were logarithmized for better visualization.</figDesc><graphic coords="4,152.06,409.21,82.39,55.48" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Fig. 4 :</head><label>4</label><figDesc>Fig. 4: Visualization of the real-time data augmentation pipeline during training.</figDesc><graphic coords="7,369.66,338.17,109.77,70.06" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>Frequency histogram of the full BirdCLEF 2017 training dataset.</figDesc><table><row><cell>Frequency</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>Relative</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>1</cell><cell>2</cell><cell>3</cell><cell>4</cell><cell>5</cell><cell>6</cell><cell>7</cell><cell>8</cell><cell>9</cell><cell>10</cell><cell>11 12</cell><cell>13</cell><cell>14 15</cell><cell>16 17</cell><cell>18</cell><cell>19 20</cell><cell>21</cell><cell>22</cell><cell>kHz</cell></row><row><cell cols="15">Fig. 3: 8 LIFECLEF2017 BIRD HD SOUNDSCAPE WAV RN49908.wav</cell><cell></cell><cell></cell><cell></cell><cell>and</cell></row><row><cell cols="15">LIFECLEF2017 BIRD HD SOUNDSCAPE WAV RN49909.wav</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="14">9 https://github.com/librosa/librosa (last access: 01.06.2017)</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1 :</head><label>1</label><figDesc>Achieved results measured in MAP</figDesc><table><row><cell></cell><cell></cell><cell></cell><cell cols="4">BirdCLEF 2017</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell cols="5">BirdCLEF 2017</cell><cell></cell></row><row><cell></cell><cell></cell><cell></cell><cell cols="4">training dataset</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell cols="5">test dataset</cell><cell></cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell cols="5">official results</cell><cell></cell></row><row><cell></cell><cell cols="4">Inception-v3</cell><cell cols="4">pre-trained</cell><cell></cell><cell></cell><cell cols="5">pre-trained</cell><cell></cell></row><row><cell></cell><cell cols="8">from scratch Inception-v3</cell><cell></cell><cell></cell><cell cols="5">Inception-v3</cell><cell></cell></row><row><cell></cell><cell>Reduced training set</cell><cell>(90% subset)</cell><cell>Validation set</cell><cell>(10% subset)</cell><cell>Reduced training set</cell><cell>(90% subset)</cell><cell>Validation set</cell><cell>(10% subset)</cell><cell>Soundscapes</cell><cell>with time-codes</cell><cell>Soundscapes</cell><cell>without time-codes</cell><cell>(same queries 2016)</cell><cell>Traditional Records</cell><cell>(only main species)</cell><cell>Traditional Records</cell><cell>(with background species)</cell></row><row><cell>Binary Run (Run 2)</cell><cell cols="17">0.627 0.415 0.815 0.487 0.069 0.048 0.491 0.431</cell></row><row><cell cols="18">Grayscale Run (Run 4) 0.490 0.303 0.928 0.541 0.083 0.023 0.504 0.438</cell></row><row><cell>Big Run (Run 1)</cell><cell cols="17">0.415 0.333 0.832 0.531 0.056 0.041 0.492 0.427</cell></row><row><cell cols="18">Combined Run (Run 3) 0.672 0.455 0.932 0.598 0.097 0.039 0.567 0.496</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0">http://www.xeno-canto.org/ (last access: 31.05.2017)</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_1">http://download.tensorflow.org/models/inception v3 2016 08 28.tar.gz (last access: 27.03.2017)</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_2">https://github.com/tensorflow/models/tree/master/slim (last access: 23.05.2017)</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_3">scopes InceptionV3/Logits and InceptionV3/AuxLogits</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_4">http://openimaj.org/ (last access: 20.05.2017)</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_5">http://download.tensorflow.org/models/inception v4 2016 09 09.tar.gz (last access: 28.05.2017) and http://download.tensorflow.org/models/inception resnet v2 2016 08 30.tar.gz (last access: 28.05.2017)</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgement</head><p>The authors gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU which supported this research.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Short term spectral analysis, synthesis, and modification by discrete fourier transform</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">B</forename><surname>Allen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Acoustics, Speech, Signal Processing</title>
		<imprint>
			<biblScope unit="volume">25</biblScope>
			<biblScope unit="page" from="235" to="238" />
			<date type="published" when="1977">1977</date>
		</imprint>
	</monogr>
	<note>ASSP-</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">LifeCLEF bird identification task 2017</title>
		<author>
			<persName><forename type="first">H</forename><surname>Goëau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Glotin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Planqué</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">P</forename><surname>Vellinga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Joly</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes of CLEF 2017 -Conference and Labs of the Evaluation forum</title>
				<meeting><address><addrLine>Dublin, Ireland</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017-09-14">11-14 September, 2017. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">LifeCLEF bird identification task 2016: The arrival of deep learning</title>
		<author>
			<persName><forename type="first">H</forename><surname>Goëau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Glotin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">P</forename><surname>Vellinga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Planqué</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Joly</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes of CLEF 2016 -Conference and Labs of the Evaluation forum</title>
		<title level="s">CEUR-WS Proceedings Notes</title>
		<meeting><address><addrLine>Évora, Portugal</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2016-09-08">5-8 September, 2016. 2016</date>
			<biblScope unit="volume">1609</biblScope>
			<biblScope unit="page" from="440" to="449" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">OpenIMAJ and ImageTerrier: Java libraries and tools for scalable multimedia analysis and indexing of images</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">S</forename><surname>Hare</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Samangooei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">P</forename><surname>Dupplaw</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 19th ACM international conference on Multimedia</title>
				<meeting>the 19th ACM international conference on Multimedia<address><addrLine>MM</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2011">2011. 2011</date>
			<biblScope unit="page" from="691" to="694" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">LifeCLEF 2017 lab overview: multimedia species identification challenges</title>
		<author>
			<persName><forename type="first">Alexis</forename><surname>Joly</surname></persName>
		</author>
		<author>
			<persName><surname>Goëau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hervé</forename><surname>Hervé And Glotin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Concetto</forename><surname>Spampinato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Pierre</forename><surname>Bonnet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Willem-Pier And</forename><surname>Vellinga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jean-Christophe And</forename><surname>Lombardo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Robert</forename><surname>Planqué</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Simone</forename><surname>Palazzo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Henning</forename><surname>Müller</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of CLEF</title>
				<meeting>CLEF</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page">2017</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Combining Pattern Classifiers: Methods and Algorithms</title>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">I</forename><surname>Kuncheva</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2014">2014</date>
			<publisher>Wiley</publisher>
		</imprint>
	</monogr>
	<note>2nd Edi</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Bird song classification in field recordings: Winning solution for NIPS4B 2013 competition</title>
		<author>
			<persName><forename type="first">M</forename><surname>Lasseck</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS</title>
				<meeting>of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS</meeting>
		<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="176" to="181" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Improving bird identification using multiresolution template matching and feature selection during training</title>
		<author>
			<persName><forename type="first">M</forename><surname>Lasseck</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes of CLEF 2016 -Conference and Labs of the Evaluation forum</title>
		<title level="s">CEUR-WS Proceedings Notes</title>
		<meeting><address><addrLine>Évora, Portugal</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2016-09-08">5-8 September, 2016. 2016</date>
			<biblScope unit="volume">1609</biblScope>
			<biblScope unit="page" from="490" to="501" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title/>
		<author>
			<persName><forename type="first">B</forename><surname>Mcfee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mcvicar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Nieto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Balke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Thome</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Liang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Battenberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Moore</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Bittner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Yamamoto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ellis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">R</forename><surname>Stoter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Repetto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Waloschek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Carr</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kranzler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Choi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Viktorin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">F</forename><surname>Santos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Holovaty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Pimenta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Lee</surname></persName>
		</author>
		<idno type="DOI">10.5281/zenodo.293021</idno>
		<ptr target="https://doi.org/10.5281/zenodo.293021" />
	</analytic>
	<monogr>
		<title level="j">librosa</title>
		<imprint>
			<biblScope unit="volume">0</biblScope>
			<biblScope unit="issue">0</biblScope>
			<date type="published" when="2017-02">feb 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Time-frequency segmentation of bird song in noisy acoustic environments</title>
		<author>
			<persName><forename type="first">L</forename><surname>Neal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Briggs</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Raich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><forename type="middle">Z</forename><surname>Fern</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP</title>
				<meeting>the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP</meeting>
		<imprint>
			<date type="published" when="2011">2011. 2011</date>
			<biblScope unit="page" from="2012" to="2015" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Learning and transferring mid-level image representations using convolutional neural networks</title>
		<author>
			<persName><forename type="first">M</forename><surname>Oquab</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Bottou</surname></persName>
		</author>
		<author>
			<persName><surname>Laptev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ivan</surname></persName>
		</author>
		<author>
			<persName><surname>Josef</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2014">2014. 2014</date>
			<biblScope unit="page" from="1717" to="1724" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Recognizing bird species in audio recordings using deep convolutional neural networks</title>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">J</forename><surname>Piczak</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes of CLEF 2016 -Conference and Labs of the Evaluation forum</title>
		<title level="s">CEUR-WS Proceedings Notes</title>
		<meeting><address><addrLine>Évora, Portugal</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2016-09-08">5-8 September, 2016. 2016</date>
			<biblScope unit="volume">1609</biblScope>
			<biblScope unit="page" from="534" to="543" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Adaptive histogram equalization and its variations</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">M</forename><surname>Pizer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">P</forename><surname>Amburn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Austin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Cromartie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Geselowitz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Greer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">T</forename><surname>Haar Romeny</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">B</forename><surname>Zimmerman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Zuiderveld</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computer Vision, Graphics and Image Processing</title>
		<imprint>
			<biblScope unit="volume">39</biblScope>
			<biblScope unit="page" from="355" to="368" />
			<date type="published" when="1987">1987</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Bag of MFCC-based words for bird identification</title>
		<author>
			<persName><forename type="first">J</forename><surname>Ricard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Glotin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes of CLEF 2016 -Conference and Labs of the Evaluation forum</title>
		<title level="s">CEUR-WS Proceedings Notes</title>
		<meeting><address><addrLine>Évora, Portugal</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2016-09-08">5-8 September, 2016. 2016</date>
			<biblScope unit="volume">1609</biblScope>
			<biblScope unit="page" from="544" to="546" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">ImageNet large scale visual recognition challenge</title>
		<author>
			<persName><forename type="first">O</forename><surname>Russakovsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Su</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Krause</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Satheesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Karpathy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Khosla</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bernstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">C</forename><surname>Berg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Fei-Fei</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Computer Vision</title>
		<imprint>
			<biblScope unit="volume">115</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="211" to="252" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<title level="m" type="main">Computer Vision</title>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">G</forename><surname>Shapiro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">C</forename><surname>Stockman</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2001">2001</date>
			<publisher>Prentice Hall</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Audio based bird species identification using deep learning techniques</title>
		<author>
			<persName><forename type="first">E</forename><surname>Sprengel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Jaggi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Kilcher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Hofmann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes of CLEF 2016 -Conference and Labs of the Evaluation forum</title>
		<title level="s">CEUR-WS Proceedings Notes</title>
		<meeting><address><addrLine>Évora, Portugal</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2016-09-08">5-8 September, 2016. 2016</date>
			<biblScope unit="volume">1609</biblScope>
			<biblScope unit="page" from="547" to="559" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Inception-v4, Inception-ResNet and the impact of residual connections on learning</title>
		<author>
			<persName><forename type="first">C</forename><surname>Szegedy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ioffe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Vanhoucke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Alemi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the International Conference on Learning Representations Workshop</title>
				<meeting>the International Conference on Learning Representations Workshop<address><addrLine>ICLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2016">2016. 2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Rethinking the inception architecture for computer vision</title>
		<author>
			<persName><forename type="first">C</forename><surname>Szegedy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Vanhoucke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ioffe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Shlens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Wojna</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/1512.00567v3" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2016">2016. 2016</date>
			<biblScope unit="page" from="2818" to="2826" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Convolutional neural networks for large-scale bird song classification in noisy environment</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">P</forename><surname>Tóth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Czeba</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes of CLEF 2016 -Conference and Labs of the Evaluation forum</title>
		<title level="s">CEUR-WS Proceedings Notes</title>
		<meeting><address><addrLine>Évora, Portugal</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2016-09-08">5-8 September, 2016. 2016</date>
			<biblScope unit="volume">1609</biblScope>
			<biblScope unit="page" from="560" to="568" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
