<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Tone Transfer: In-Browser Interactive Neural Audio Synthesis</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Michelle</forename><surname>Carney</surname></persName>
							<email>michellecarney@google.com</email>
							<affiliation key="aff0">
								<orgName type="institution">Google Inc</orgName>
								<address>
									<addrLine>1600 Amphitheatre Pkwy</addrLine>
									<postCode>94043</postCode>
									<settlement>Mountain View</settlement>
									<region>CA</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Chong</forename><surname>Li</surname></persName>
							<email>chongli@google.com</email>
							<affiliation key="aff0">
								<orgName type="institution">Google Inc</orgName>
								<address>
									<addrLine>1600 Amphitheatre Pkwy</addrLine>
									<postCode>94043</postCode>
									<settlement>Mountain View</settlement>
									<region>CA</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Edwin</forename><surname>Toh</surname></persName>
							<email>edwintoh@google.com</email>
							<affiliation key="aff0">
								<orgName type="institution">Google Inc</orgName>
								<address>
									<addrLine>1600 Amphitheatre Pkwy</addrLine>
									<postCode>94043</postCode>
									<settlement>Mountain View</settlement>
									<region>CA</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Nida</forename><surname>Zada</surname></persName>
							<email>nzada@google.com</email>
							<affiliation key="aff0">
								<orgName type="institution">Google Inc</orgName>
								<address>
									<addrLine>1600 Amphitheatre Pkwy</addrLine>
									<postCode>94043</postCode>
									<settlement>Mountain View</settlement>
									<region>CA</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ping</forename><surname>Yu</surname></persName>
							<email>piyu@google.com</email>
							<affiliation key="aff0">
								<orgName type="institution">Google Inc</orgName>
								<address>
									<addrLine>1600 Amphitheatre Pkwy</addrLine>
									<postCode>94043</postCode>
									<settlement>Mountain View</settlement>
									<region>CA</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jesse</forename><surname>Engel</surname></persName>
							<email>jesseengel@google.com</email>
							<affiliation key="aff0">
								<orgName type="institution">Google Inc</orgName>
								<address>
									<addrLine>1600 Amphitheatre Pkwy</addrLine>
									<postCode>94043</postCode>
									<settlement>Mountain View</settlement>
									<region>CA</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Tone Transfer: In-Browser Interactive Neural Audio Synthesis</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">7CBF7063B837A4A82742A7F479EF15A8</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T08:13+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>interactive machine learning</term>
					<term>dsp</term>
					<term>audio</term>
					<term>music</term>
					<term>vocoder</term>
					<term>synthesizer</term>
					<term>signal processing</term>
					<term>tensorflow</term>
					<term>autoencoder</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Here, we demonstrate Tone Transfer, an interactive web experience that enables users to use neural networks to transform any audio input into an array of several different musical instruments. By implementing fast and efficient neural synthesis models in TensorFlow.js (TF.js), including special kernels for numerical stability, we are able to overcome the size and latency of typical neural audio synthesis models to create a real-time and interactive web experience. Finally, Tone Transfer was designed from extensive usability studies with both musicians and novices, focusing on enhancing creativity of users across a variety of skill levels.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Neural audio synthesis, generating audio with neural networks, can extend human creativity by creating new synthesis tools that are expressive and intuitive <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2,</ref><ref type="bibr" target="#b2">3,</ref><ref type="bibr" target="#b3">4]</ref>. However, most neural networks are too computationally expensive for interactive audio generation, especially on the web and mobile devices <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b6">7]</ref>. Differentiable Digital Signal Processing (DDSP) models are a new class of algorithms that overcome these challenges by leveraging prior signal processing knowledge to make synthesis networks small, fast, and efficient <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b8">9]</ref>.</p><p>Tone Transfer is a musical experience powered by Magenta's open source DDSP library 1 to model and map between the characteristics of different musical instruments with machine learning. The process can lead to creative, quirky results. For example replacing a capella singing with a saxophone solo, or a dog barking with a trumpet performance.</p><p>Tone Transfer was created as an invitation to novices and musicians to take part in the future of machine learning and creativity. Our focus was on cultural inclusion, increased awareness of machine learning for artists and the general public, and inspiring excitement of the future of creative work among musicians. We</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">User Interface Design</head><p>We created the Tone Transfer website (https://sites. research.google/tonetransfer) to allow anyone to experiment with DDSP, regardless of their musical experience, on both desktop and mobile. Through multiple rounds of usability studies with musicians, we have been able to distill the following three main features in Tone Transfer:</p><p>• Play with curated music samples. To understand what DDSP can do, the user could click to listen to a wide range of pre-recorded samples and their machine learning transformations in other instruments.</p><p>• Record and transform new music. We also provided options for users to record or upload new sounds and transform them into four instruments in browser.</p><p>• Adjust the music. We know that control is important for the user so we allow the user to adjust the octave, loudness, and mixing of the machine learning transformations to get desired music output. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Models</head><p>At a technical level, the goal of our system is to be able to create a monophonic synthesizer that can take coarse user inputs of Pitch and Loudness and convert them into detailed synthesizer coefficients that produce realistic sounding outputs. We find this is possible with a carefully designed variant of the standard Autoencoder architecture, where we train the model to:</p><p>• Encode: Extract pitch and loudness signals from audio.</p><p>• Decode: Use a network to convert pitch and loundess into synthesizer controls.</p><p>• Synthesize: Use DDSP modules to convert synthesizer controls to audio.</p><p>We then compare the synthesized audio to the original audio with a multi-scale spectrogram loss <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b10">11]</ref> to train the parameters of the decoder network.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Encoding Features</head><p>To extract pitch during training (fundamental frequency, 𝑓 0 ), we use a pretrained CREPE network <ref type="bibr" target="#b11">[12]</ref>. Dur-ing inference we use the SPICE model, which is faster and has an implementation available in TF.js (https: //tfhub.dev/google/tfjs-model/spice/2/default/1).</p><p>While the original DDSP paper used perceptually weighted spectrograms for loudness, we find that the root-mean-squared (RMS) power of the waveform works well as a proxy and is less expensive to compute. We train on 16kHz audio, with a hop size of 64 samples (4ms) and a forward-facing (non-centered) frame size of 1024 samples (64ms). We convert power to decibels, and scale pitch and power to the range [0, 1] before passing the features to the decoder.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Decoder Network</head><p>The decoder converts the encoded features (𝑓 0 , power) into synthesizer controls for each frame of audio (250Hz, 4ms). As we discuss in Section 3.3, for the DDSP models in this work, the synthesizer controls are the harmonic amplitude (𝐴), harmonic distribution (𝑐 𝑘 ), and filtered noise magnitudes.</p><p>The DDSP modules are agnostic to the model architecture used and convert model outputs to desired control ranges using custom nonlinearities as described in <ref type="bibr" target="#b7">[8]</ref>.</p><p>We use two stacks of non-causal dilated convolution layers as the decoder. Each stack begins with a nondilated input convolution layer, followed by 8 layers, with a dilation factor increasing in powers of 2 from 1 to 128. Each layer has 128 channels and a kernel size of 3, and is followed by layer normalization <ref type="bibr" target="#b12">[13]</ref>, and a ReLU nonlinearity <ref type="bibr" target="#b13">[14]</ref>. The scale and shift of the layer normalization are controlled by the pitch and power conditioning after it is run through a 1x1 convolution with 128 channels. The complete model has ∼ 830𝑘 trainable parameters.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Differentiable Synthesizers</head><p>To generate audio, we use a combination of additive (Harmonic) and subtractive (Filtered Noise) synthesis techniques. Inspired by the work of <ref type="bibr" target="#b14">[15]</ref>, we model sound as a flexible combination of time-dependent sinusoidal oscillators and filtered noise. DDSP makes these operations differentiable for end-to-end training by implementing them in TensorFlow <ref type="bibr" target="#b15">[16]</ref>. Full details can be found in the original papers <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b8">9]</ref>, but for clarity, we review the main modules here.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.1.">Sinusoidal Oscillators</head><p>A sinusoidal oscillator bank is an additive synthesizer that consists of 𝐾 sinusoids with individually varying amplitudes 𝐴 𝑘 and frequencies 𝑓 𝑘 . These are flexibly specified by the output of a neural network over 𝑛 discrete time steps (250Hz, 4ms per frame):</p><formula xml:id="formula_0">𝑥(𝑛) = 𝐾 −1 ∑ 𝑘=0 𝐴 𝑘 (𝑛) sin(𝜙 𝑘 (𝑛)),<label>(1)</label></formula><p>where 𝜙 𝑘 (𝑛) is its instantaneous phase obtained by cumulative summation of the instantaneous frequency 𝑓 𝑘 (𝑛):</p><formula xml:id="formula_1">𝜙 𝑘 (𝑛) = 2𝜋 𝑛 ∑ 𝑚=0 𝑓 𝑘 (𝑚),<label>(2)</label></formula><p>The network outputs amplitudes 𝐴 𝑘 and frequencies 𝑓 𝑘 every 4ms, which are upsampled to audio rate (16kHz) using overlapping Hann windows and linear interpolation respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.2.">Harmonic Synthesizer</head><p>Since we train on individual instruments with strong harmonic relationships of their partials, we can reparameterize the sinusoidal oscillator bank as a harmonic oscillator, with a single fundamental frequency 𝑓 0 , amplitude 𝐴, and harmonic distribution 𝑐 𝑘 . All the output frequencies are constrained to be harmonic (integer) multiples of a fundamental frequency (pitch),</p><formula xml:id="formula_2">𝑓 𝑘 (𝑛) = 𝑘𝑓 0 (𝑛)<label>(3)</label></formula><p>Individual amplitudes are deterministically retrieved by multiplying the total amplitude, 𝐴(𝑛), with the normalized distribution over harmonic amplitudes, 𝑐 𝑘 (𝑛): 𝐴 𝑘 (𝑛) = 𝐴(𝑛)𝑐 𝑘 (𝑛).</p><p>(4)</p><p>where ,</p><formula xml:id="formula_3">𝐾 −1 ∑ 𝑘=0 𝑐 𝑘 (𝑛) = 1, 𝑐 𝑘 (𝑛) ≥ 0<label>(5)</label></formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.3.">Filtered Noise Synthesizer</head><p>We can model the non-periodic audio components as a subtractive synthesizer, with a linear time-varying filtered noise source. White noise is generated from a uniform distribution, which we then filter with an Finite Impulse Response (FIR) filter. Since the network outputs different coefficients of the frequency response in each frame, it creates an expressive time-varying filter.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.4.">Reverb</head><p>To first approximation, room responses with fixed source and listener locations can be approximated by a single impulse response that can be applied as a FIR filter. In terms of neural networks, this is equivalent to a 1-D convolution with a very large receptive field (∼40k).</p><p>We treat the impulse response as a learned variable, and train a new response (jointly with the rest of the model) for each dataset with a unique recording environment.</p><p>To better disentangle the signal from the room response, we generate the impulse response with a filtered noise synthesizer as described in Section 3.3.3, and learn the transfer function coefficients to generate a desired impulse response. This prevents coherent impulse responses at short time scales that can entangle the frequency response of the synthesizer with the room response. At inference, we discard the expensive convolutional reverb component to synthesize the "dry" signal, and apply a more efficient stock reverb effect.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Training</head><p>Given that the DDSP model described above is for monophonic instruments, we collect data of individual instruments, and train a separate model for each dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.1.">Data</head><p>We train models on four instruments: Violin, Flute, Trumpet, and Saxophone. Following <ref type="bibr" target="#b16">[17]</ref> and <ref type="bibr" target="#b7">[8]</ref>, we use home recordings of Trumpet and Saxophone for training, and collected performances of Flute and Violin from the MusOpen royalty free music library 2 .</p><p>Since DDSP models are efficient to train, for each instrument we only need to collect between 10 and 15 minutes of performance, and we ensure a that all recordings are from the same room environment to allow training a single reverb impulse response.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.2.">Optimization</head><p>We train models with the Adam optimizer <ref type="bibr" target="#b17">[18]</ref>, examples 4 seconds in length, batch size of 128, and learning rate of 3e-4. As we would like to use models to generalize to new types of pitch and loudness inputs, we reduce overfitting through early stopping, typically between 20k and 40k iterations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Interactive Models</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">On-device Inference with Magenta.js</head><p>Musical interaction has strong requirements for close to real-time feedback and low latency. However, machine learning models are typically slow and computationally expensive, requiring GPU or TPU servers to run at all. Further, large model sizes lead to long load times before execution can even begin. Running models on-device, if possible, eliminates serving costs, decreases interactive latency, and increases accessibility. To create an interactive and scalable musical experience, we optimized and converted models to be compatible with Tensorflow.js so that they can run ondevice in the browser on both desktop and mobile devices.</p><p>Even after optimization, the models are still relatively large (4mb each), so each model is only loaded on demand. This ensured the user downloads only the things they need, and nothing more, which resulted in a fast and responsive website.</p><p>The methods to extract pitches, and the four models that are on the website are then open sourced and These methods are added to the Magenta.js library.<ref type="foot" target="#foot_0">3</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Custom TF.JS Kernels to Preserve Precision</head><p>TensorFlow.js is a web ML platform that provides hardware acceleration through web APIs like WebGL and WebAssembly. DDSP relies on TensorFlow.js to speed up the model execution. To maintain accuracy of DDSP model on a variety of devices, we implemented a couple of special kernels that eliminated overflow (𝑎𝑏𝑠(𝑛) &gt; 65504) and underflow (𝑎𝑏𝑠(𝑛) &lt; 2 −10 ) of float16 texture when running on the TensorFlow.js WebGL backend.</p><p>For example, the DDSP model uses TensorFlow Cumsum op to calculate the cumulative summation of the instantaneous frequency, then obtain the phase from those values. TensorFlow.js implements a parallel algorithm <ref type="foot" target="#foot_1">4</ref> for cumulative sum, which requires log(n) writes of intermediate tensors to the GPU textures. The cumulative precision loss would cause a large shift on the final phase values. The solution is to register a custom Cumsum op that uses a serialized algorithm that avoids all intermediate texture writes and is incorporated with the phase computation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion and Future Work</head><p>Tone Transfer is an example of an interdisciplinary design, engineering, and AI research teams working together to create a User Interface Design for the next wave of AI. We leverage state-of-the-art machine learning models that are both expressive and efficient, and optimize them for client-side use to enable interactive neural audio synthesis on the web. This work demonstrates that on-device machine learning can enable interactive and creative music making experiences for novices and musicians alike. The technologies that power Tone Transfer have also been open sourced as a part of Magenta.js and provide a solid foundation for further interactive studies. Future work will hopefully allow users to train their own models based on their own instruments, and explore using new types of inputs to create multi-sensory experiences.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: The web user interface of Tone Transfer</figDesc><graphic coords="2,117.64,70.16,360.00,225.00" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: A diagram of the DDSP autoencoder training. Source audio is encoded to a 2-dimensional input feature (pitch and power), that the decoder converts to a 126-dimensional synthesizer controls (amplitude, harmonic distribution, and noise frequency response). We use the CREPE model for pitch detection during training and the SPICE model for pitch detection during inference. These controls are synthesized by a filtered noise synthesizer and harmonic synthesizer, mixed together, and run through a trainable reverb module. The resulting audio is compared against the original audio with a multi-scale spectrogram loss. Blue components represent the source audio and resynthesized audio. Yellow components are fixed components (pitch tracking, DDSP synthesizers, and loss function), green components are intermediate features (decoder inputs and synthesizer controls), and red components have trainable parameters (decoder layers and reverb impulse response).</figDesc><graphic coords="3,74.41,70.16,473.60,183.36" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>2</head><label></label><figDesc>Violin: Five pieces by John Garner (II. Double, III. Corrente, IV. Double Presto, VI. Double, VIII. Double, Flute: Four pieces by Paolo Damoro (24 Etudes for Flute, Op. 15 -III. Allegro con brio in G major, 24 Etudes for Flute, Op. 15 -VI. Moderato in B minor, 3 Fantaisies for Solo Flute, Op. 38 -Fantaisie no. 1, Sonata Appassionata, Op. 140)) from https://musopen.org/music/ 13574-violin-partita-no-1-bwv-1002/ made easier for anyone to download and run their own experiences. Each model comes with a set of custom values that are manually tweaked to create a more accurate output.</figDesc></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0">https://github.com/magenta/magenta-js/tree/master/music# ddsp</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_1">https://en.wikipedia.org/wiki/Prefix_sum#Parallel_ algorithms</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>We would like to acknowledge the contributions of everyone who made Tone Transfer possible, including Lamtharn (Hanoi) Hantrakul, Doug Eck, Nida Zada, Mark Bowers, Katie Toothman, Edwin Toh, Justin Secor, Michelle Carney, and Chong Li, and many others at Google. Thank you.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">V D</forename><surname>Oord</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Dieleman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Simonyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Vinyals</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Graves</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Kalchbrenner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Senior</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Kavukcuoglu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1609.03499</idno>
		<title level="m">Wavenet: A generative model for raw audio</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Neural audio synthesis of musical notes with WaveNet autoencoders</title>
		<author>
			<persName><forename type="first">J</forename><surname>Engel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Resnick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Roberts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Dieleman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Eck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Simonyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Norouzi</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2017">2017</date>
			<publisher>ICML</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">GANSynth: Adversarial neural audio synthesis</title>
		<author>
			<persName><forename type="first">J</forename><surname>Engel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">K</forename><surname>Agrawal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Gulrajani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Donahue</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Roberts</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><forename type="first">N</forename><surname>Mor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Wolf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Polyak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Taigman</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1805.07848</idno>
		<title level="m">A universal music translation network</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">N</forename><surname>Kalchbrenner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Elsen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Simonyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Noury</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Casagrande</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Lockhart</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Stimberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">V D</forename><surname>Oord</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Dieleman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Kavukcuoglu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1802.08435</idno>
		<title level="m">Efficient neural audio synthesis</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Fast and flexible neural audio synthesis</title>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">H</forename><surname>Hantrakul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Engel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Roberts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Gu</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2019">2019</date>
			<publisher>ISMIR</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Parallel wavenet: Fast high-fidelity speech synthesis</title>
		<author>
			<persName><forename type="first">A</forename><surname>Oord</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Babuschkin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Simonyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Vinyals</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Kavukcuoglu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Driessche</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Lockhart</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Cobo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Stimberg</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International conference on machine learning</title>
				<imprint>
			<publisher>PMLR</publisher>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="3918" to="3926" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Ddsp: Differentiable digital signal processing</title>
		<author>
			<persName><forename type="first">J</forename><surname>Engel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">H</forename><surname>Hantrakul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Gu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Roberts</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ternational Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">Self-supervised pitch detection by inverse audio synthesis</title>
		<author>
			<persName><forename type="first">J</forename><surname>Engel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Swavely</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">H</forename><surname>Hantrakul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Roberts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hawthorne</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<author>
			<persName><forename type="first">P</forename><surname>Dhariwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Jun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Payne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2005.00341</idno>
		<title level="m">Jukebox: A generative model for music</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Takaki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yamagishi</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1904.12088</idno>
		<title level="m">Neural source-filter waveform models for statistical parametric speech synthesis</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Crepe: A convolutional representation for pitch estimation</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Salamon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Bello</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2018">2018. 2018</date>
			<biblScope unit="page" from="161" to="165" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">L</forename><surname>Ba</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">R</forename><surname>Kiros</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">E</forename><surname>Hinton</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1607.06450</idno>
		<title level="m">Layer normalization</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">Rectified linear units improve restricted boltzmann machines</title>
		<author>
			<persName><forename type="first">V</forename><surname>Nair</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">E</forename><surname>Hinton</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2010">2010</date>
			<publisher>ICML</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Spectral modeling synthesis: A sound analysis/synthesis system based on a deterministic plus stochastic decomposition</title>
		<author>
			<persName><forename type="first">X</forename><surname>Serra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Smith</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computer Music Journal</title>
		<imprint>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="page" from="12" to="24" />
			<date type="published" when="1990">1990</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Abadi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Barham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Brevdo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Citro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">S</forename><surname>Corrado</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Davis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dean</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Devin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ghemawat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Goodfellow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Harp</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Irving</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Isard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Jia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Jozefowicz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Kaiser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kudlur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Levenberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Mané</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Monga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Moore</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Murray</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Olah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Schuster</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Shlens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Steiner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Talwar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Tucker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Vanhoucke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Vasudevan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Viégas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Vinyals</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Warden</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wattenberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wicke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zheng</surname></persName>
		</author>
		<ptr target="https://www.tensorflow.org/,softwareavailablefromtensorflow.org" />
		<title level="m">TensorFlow: Largescale machine learning on heterogeneous systems</title>
				<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<author>
			<persName><forename type="first">G</forename><surname>Scouts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Magenta</surname></persName>
		</author>
		<ptr target="https://sites.research.google/tonetransfer" />
		<title level="m">Tonetransfer</title>
				<imprint>
			<date type="published" when="2020-12-10">2020. 2020-12-10</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">P</forename><surname>Kingma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ba</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1412.6980</idno>
		<title level="m">Adam: A method for stochastic optimization</title>
				<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
