<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Audio Bird Classification with Inception-v4 extended with Time and Time-Frequency Attention Mechanisms</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Antoine</forename><surname>Sevilla</surname></persName>
							<email>antoine-sevilla@etud.univ-tln.fr</email>
							<affiliation key="aff0">
								<orgName type="department">AMU</orgName>
								<orgName type="laboratory">LSIS UMR 7296</orgName>
								<orgName type="institution" key="instit1">Univ. Toulon</orgName>
								<orgName type="institution" key="instit2">CNRS</orgName>
								<orgName type="institution" key="instit3">ENSAM</orgName>
								<address>
									<settlement>DYNI team</settlement>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Hervé</forename><surname>Glotin</surname></persName>
							<email>herve.glotin@univ-tln.fr</email>
							<affiliation key="aff0">
								<orgName type="department">AMU</orgName>
								<orgName type="laboratory">LSIS UMR 7296</orgName>
								<orgName type="institution" key="instit1">Univ. Toulon</orgName>
								<orgName type="institution" key="instit2">CNRS</orgName>
								<orgName type="institution" key="instit3">ENSAM</orgName>
								<address>
									<settlement>DYNI team</settlement>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Audio Bird Classification with Inception-v4 extended with Time and Time-Frequency Attention Mechanisms</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">C37D5ED479345F1F72847897DCF78B3A</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T20:31+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Deep Learning</term>
					<term>Inception-v4</term>
					<term>Bird Species Classification</term>
					<term>Transfer Learning</term>
					<term>Attention Mechanism</term>
					<term>Sound Detection</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>We present an adaptation of the deep convolutional network Inception-v4 tailored to solving bioacoustic classification problems. Bird sound classification was treated as if it were an image classification problem by a transfer learning of Inception. Inception, the state-of-the-art in image classification, was used together with an attention algorithm, to (multiscale) time-frequency representations or images of bird sounds. This has resulted in an efficient pipeline, that we call Soundception. Soundception scored highest on all tasks in the BirdClef2017 challenge. It reached 0.714 Mean Average Precision in the task that asked for classification of 1500 bird species. To our knowledge Soundception is currently the most effective model for biodiversity monitoring of complex soundscapes.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The main objective of our approach is to create an easy-to-use pipeline of an acoustic model from an image model. We want to stay in the same framework of the state-of-the art deep learning <ref type="bibr" target="#b6">[7]</ref> Inception-v4, and to transfer it to soundscape classification. Inception-v4 has been pre-trained on imagenet, then we adapt its inputs and learn a bird acoustic activity detector thanks to the classification outputs joint to an attention mechanism. Then we process a transfer learning from pre-trained weights on imagenet to bird classification. The paper describes our methodology to efficiently build this model, that we call 'Soundception', in few weeks a reduced GPU resources. We show that Soundception gives to the best scores of the BirdClef 2017 challenge <ref type="bibr" target="#b4">[5]</ref>. In the last section we discuss on perspectives to increase the accuracy of Soundception.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Audio featuring</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Audio to tri-channel time-frequency image</head><p>The data representation is a crucial step in any learning process. In our approach, the representation must be scalable and based on image processing including some acoustic specificities, because we take advantage of Google Inception that is fed by a RGB image. Therefore, we generate three log-spectrograms by fast Fourier transform at three scales : window size of w i∈{0,1,2} = 2 (2×i) × 128 (i.e. 128, 512, 2048). This fast computation approximates a compressed multi-scale representation of voices and chirps of birds. We think it improves usual spectral representations that deal with either temporal or frequency resolution (see usual representation in <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b2">3]</ref>). Next, we reshape the three spectrograms by bilinear interpolation, into an optimal dimension for Inception inputs. In sum, our audio featuring is :</p><p>1. Resample the dataset to 22050 Hz sampling rate. 2. Let min duration be the accepted minimum duration of audio sample. 3. Let min subduration be the accepted minimum duration of the subimage. 4. Let d(x) be the duration of the audio sample x. 5. While d(x) &lt; min duration self concatenate x. 6. Compute three log-spectrogram S i (x) with window sizes w i ∈ (128, 512, 2048). 7. Remove outliers of the S i (x) distribution to avoid quantification error. 8. Resize (bilinear interpolation) S i (x) to an optimal format for Inception: height = 299 pixels for the frequency dimension, 299×min subduration/1.5 pixels for the time dimension. 9. Concatenate the three S i (x) into one 3 channels RGB multiscale image I.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Data augmentation</head><p>During the training stage, we run data augmentation. Therefore, we use standard transformations in computer vision. More precisely we run the Inception preprocessing on the spectrograms S i as random hue, contrast, brightness, and saturation, plus random crop in time and frequency, as follows :  3 Model specification</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Transfer learning from Inception-v4 to Soundception</head><p>Inception-v4 is the state-of-the-art in computer vision <ref type="bibr" target="#b0">[1]</ref>. There are several ways to adapt Inception-v4 to time-frequency analysis, as a simple average pooling on the time axis, or use recurrent layer at the top of the network or both. Here we adapt Inception-v4 to make it entirely convolutional on the time domain, the aim being to make it invariant to temporal translation, and to allow arbitrary widthsized image. Secondly we add a time and a time-frequency attention mechanisms into the branches as represented in the synopsis of Inception-v3 Fig <ref type="figure" target="#fig_3">3</ref>. We set dI c = 15 seconds, min d = 60 sec., scale = 1.5 sec. for 299 pixels in respect to baseline Inception inputs of 299 × 299 pixels, and the available VRAM per GPU (12 Go). We do not use specific bird detection in order to avoid handcrafted detection which could weaken the complete pipeline. The detection is processed by an attention mechanism as presented in the next section. It runs on a large time window (dI c = 15sec.) to increase the probability of bird activity in the image. The main process results into one time frequency RGB image (as Fig. <ref type="figure" target="#fig_0">1</ref>) per audio file, thus 36492 in BirdClef 2017.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Attention mechanisms in time and time-frequency</head><p>Attention mechanisms are gaining popularity. The goal is to focus attention somewhere or on something. We can learn detection from classification by adding a soft attention mechanism <ref type="bibr" target="#b1">[2]</ref>. Thus, we add to the Inception model two attention mechanisms : a temporal attention into the auxiliary branch, and a timefrequency attention in the main branch.</p><p>Each attention mechanism is an element wise product of the detector feature map with the feature maps of the previous Inception layers. These mechanisms learn how to pass the information and thus play the role of bird activity detectors.</p><p>The first attention mechanism is the temporal attention defined by a sigmoid activation because the bird temporal activities are expected to be binomial in time. The second attention mechanism is defined by the softmax of the outputs<ref type="foot" target="#foot_0">1</ref> , yielding to smooth neuron time-frequency activity distribution.</p><p>Next, based on the sigmoid or softmax of the outputs, we compute the element wise product between the feature maps of the signal and the feature maps of the detector.</p><p>In Fig. <ref type="figure" target="#fig_4">4</ref> we show how Soundception focuses in time on bird activities. In Fig. <ref type="figure" target="#fig_5">5</ref> we show that in time-frequency Soundception indeed focuses on the loudest formant/frequency of the bird call. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Training stage</head><p>The training of the model took several days depends of hyper-parameters, using randomly split training (90%) and (10%) validation sets as in <ref type="bibr" target="#b3">[4]</ref>. For the transfer learning, we first train only the top layer of Soundception, then we fine tune all layers. We split training in different stages with different batch sizes, according to the available GPU memory :</p><p>1. Train the model with time window of dI c = 15sec., 2. Train last layers and detectors with mini-batch size of 8, 3. Fine tune all layers with mini-batch size of 4.</p><p>There are different options to evaluate the prediction on the development set. We could consider the audio file transform into images I as previously described with arbitrary temporal size. Here, we optimize the model according to the average score of the predictions from the subimages I c of the main images I.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Results</head><p>Fig. <ref type="figure">6</ref>. Results of all the challengers in the BirdClef2017 international contest. The DYNI UTLN RUN1 is the best in three task categories : Soundscapes with time-code, and the two traditional classification tasks. It is third in the 'Soundscape without timecodes'. Complementary, the DYNI UTLN Run2 yields to lower scores in average but to the best score in the other task 'Soundscape without time-codes'.</p><p>We report Fig. <ref type="figure">6</ref> the official scores on the four tasks of each of the challenger. Our model Soundception wins this challenge in the four tasks. The DYNI UTLN RUN1 depicted in this paper is the best model in three of the tasks, with 0.714 Mean Average Precision (MAP) on the 1500 species 'traditional records' task, 0.616 MAP with 'background species' task, and 0.288 MAP on the 'Soundscapes with time-codes' task. It is third in the 'Soundscape without time-codes' task, for which our other run (DYNI UTLN Run2) which explored different parameters is first.</p><p>These results are good despite the fact that we had not time to completely train Soundception on different topologies and to develop associated preprocessings in the four weeks of the challenge.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion and future work</head><p>In this paper we show how we transferred Inception-v4 from image to the acoustic domain, and how it learns bird sound detection by itself using attention models. The results show that it is possible to tackle state-of-the-art sound classification by the transfer learning of efficient pre-existing image classification model. This strategy can be useful to tackle other challenge without pre-segmentation.</p><p>We had not been able to let completely converge the training stage of Soundception due to the huge computation and GPU needs, however it reaches the best results in the BirdClef 2017 challenge. There is a lot to be done in this area. Our current work also explores different scalable optimizations to learn audio to image representations instead of pseudo multi-scale FFT spectrograms. We currently develop a model with stacked GRU at the top of the network.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. From Top to Bottom, S1, S2, S3, resp. 128, 512, and 2048 bins FFT window spectrograms of a bird call. The fourth at the Bottom overlaps the three above and shows their complementarity.</figDesc><graphic coords="2,152.06,282.96,311.25,54.71" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>1 .</head><label>1</label><figDesc>Random choice of an image I in the dataset. 2. Random crop a subimage I c from I: let hI c = initial height of I c = 299, let dI c = initial duration of I c = 15sec. ∼ 299 × 10, set random temporal dilatation factor of I c uniformly sampled in [0.95, 1.05], set random top of I c uniformly sampled in [0.96, 1] × hI c , set random bottom of I c uniformly sampled in [0, 0.01] × hI c , crop and resize I c from I with above parameters and random time offset. 3. Vision preprocessing of I c by hue, contrast, brightness, saturation variations. 4. Add random noise or process local brightness to I c .</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 2 .</head><label>2</label><figDesc>Fig. 2. Sample of a multiscale representation of bird activities, before (Top) versus after (Bottom) data augmentation.</figDesc><graphic coords="3,152.06,404.26,311.24,54.71" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Fig. 3 .</head><label>3</label><figDesc>Fig. 3. Architecture of Inception-v3 (credit: [8]), similar to Inception-v4 model, showing the place of the branch where is connected each of the attention mechanism.</figDesc><graphic coords="4,134.77,115.84,345.82,191.48" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Fig. 4 .</head><label>4</label><figDesc>Fig. 4. Example of temporal attention on spectrogram (Top). The temporal attention is shown in white segments (Bottom).</figDesc><graphic coords="5,152.08,115.84,311.21,54.71" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Fig. 5 .</head><label>5</label><figDesc>Fig. 5. Example of time-frequency attention : the spectrogram image (Top), the timefrequency attention (Bottom), in grey scale (highest attention is white, lowest is black).</figDesc><graphic coords="5,152.08,426.99,311.21,54.71" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0"><head></head><label></label><figDesc></figDesc><graphic coords="6,134.77,281.94,345.82,218.84" type="bitmap" /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">Normalization with softmax is: (exp nu)/ v exp nv.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Acknowledgements. We thank Laura Bessone for her support in this paper. We thank XenoCanto, LifeClef team, EADM GDR CNRS MADICS, SABIOD.org and Amazon Explorama Lodges with Lucio Pando, P. Bucur and M. Trone for the co-organization of this challenge. This research is also supported by STIC-AmSud BRILAAM for South American bioacoustics. We thank TPM, CG83, UTLN for their support in the Captile project on soundscape analysis. We thank V. Roger for setting the GPU, and S. Paris for lending them. We are grateful to the anonymous reviewers. Antoine Sevilla has set up the experimentations and ideas in this article.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><surname>Szegedy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ioffe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Vanhoucke</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1602.07261</idno>
		<title level="m">Inception-v4, inception-resnet and the impact of residual connections on learning</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Show, attend &amp; tell: Neural image caption generation with visual attention</title>
		<author>
			<persName><forename type="first">K</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ba</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Kiros</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Cho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Courville</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Salakhutdinov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zemel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Bengio</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2015">2015</date>
			<publisher>ICML</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Goeau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Glotin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">P</forename><surname>Vellinga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Planque</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Joly</surname></persName>
		</author>
		<title level="m">LifeCLEF Bird Identification Task 2016: The arrival of Deep learning</title>
				<imprint>
			<publisher>CLEF</publisher>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Audio based bird species identification using deep learning techniques</title>
		<author>
			<persName><forename type="first">E</forename><surname>Sprengel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">K</forename><surname>Martin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Jaggi</surname></persName>
		</author>
		<author>
			<persName><surname>Hofmann</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2016">2016</date>
			<publisher>CLEF</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Joly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Goëau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Glotin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Spampinato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Bonnet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-P</forename><surname>Vellinga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-C</forename><surname>Lombardo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Planqué</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Palazzo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Müller</surname></persName>
		</author>
		<title level="m">LifeCLEF 2017 Lab Overview: multimedia species identification challenges</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<ptr target="www.xeno-canto.org" />
		<title level="m">Xeno Canto Foundation: Sharing bird sounds from around the world</title>
				<imprint>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Abadi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Barham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Brevdo</surname></persName>
		</author>
		<ptr target="org" />
		<title level="m">TensorFlow: Large-scale machine learning on heterogeneous systems</title>
				<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
	<note>tensorflow.</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
