=Paper=
{{Paper
|id=Vol-2903/IUI21WS-HAIGEN-3
|storemode=property
|title=Tone Transfer: In-Browser Interactive Neural Audio Synthesis
|pdfUrl=https://ceur-ws.org/Vol-2903/IUI21WS-HAIGEN-3.pdf
|volume=Vol-2903
|authors=Michelle Carney,Chong Li,Edwin Toh,Nida Zada,Ping Yu,Jesse Engel
|dblpUrl=https://dblp.org/rec/conf/iui/CarneyLTZYE21
}}
==Tone Transfer: In-Browser Interactive Neural Audio Synthesis==
Tone Transfer: In-Browser Interactive Neural Audio
Synthesis
Michelle Carneya , Chong Lia , Edwin Toha , Nida Zadaa , Ping Yua and Jesse Engela
a Google Inc., 1600 Amphitheatre Pkwy, Mountain View, CA, 94043, USA
Abstract
Here, we demonstrate Tone Transfer, an interactive web experience that enables users to use neural networks to transform
any audio input into an array of several different musical instruments. By implementing fast and efficient neural synthesis
models in TensorFlow.js (TF.js), including special kernels for numerical stability, we are able to overcome the size and latency
of typical neural audio synthesis models to create a real-time and interactive web experience. Finally, Tone Transfer was
designed from extensive usability studies with both musicians and novices, focusing on enhancing creativity of users across
a variety of skill levels.
Keywords
interactive machine learning, dsp, audio, music, vocoder, synthesizer, signal processing, tensorflow, autoencoder,
1. Introduction did this through an interactive, in-browser creative ex-
perience.
Neural audio synthesis, generating audio with neu-
ral networks, can extend human creativity by creat-
ing new synthesis tools that are expressive and intu- 2. User Interface Design
itive [1, 2, 3, 4]. However, most neural networks are
too computationally expensive for interactive audio gen- We created the Tone Transfer website (https://sites.
eration, especially on the web and mobile devices [5, research.google/tonetransfer) to allow anyone to ex-
6, 7]. Differentiable Digital Signal Processing (DDSP) periment with DDSP, regardless of their musical expe-
models are a new class of algorithms that overcome rience, on both desktop and mobile. Through multiple
these challenges by leveraging prior signal processing rounds of usability studies with musicians, we have
knowledge to make synthesis networks small, fast, and been able to distill the following three main features
efficient [8, 9]. in Tone Transfer:
Tone Transfer is a musical experience powered by
• Play with curated music samples. To understand
Magenta’s open source DDSP library1 to model and
what DDSP can do, the user could click to lis-
map between the characteristics of different musical
ten to a wide range of pre-recorded samples and
instruments with machine learning. The process can
their machine learning transformations in other
lead to creative, quirky results. For example replac-
instruments.
ing a capella singing with a saxophone solo, or a dog
barking with a trumpet performance.
Tone Transfer was created as an invitation to novices • Record and transform new music. We also pro-
and musicians to take part in the future of machine vided options for users to record or upload new
learning and creativity. Our focus was on cultural in- sounds and transform them into four instruments
clusion, increased awareness of machine learning for in browser.
artists and the general public, and inspiring excitement
of the future of creative work among musicians. We • Adjust the music. We know that control is im-
portant for the user so we allow the user to ad-
just the octave, loudness, and mixing of the ma-
Joint Proceedings of the ACM IUI 2021 Workshops, April 13-17, 2021,
College Station USA
chine learning transformations to get desired mu-
email: michellecarney@google.com (M. Carney); sic output.
chongli@google.com (C. Li); edwintoh@google.com (E. Toh);
nzada@google.com (N. Zada); piyu@google.com (P. Yu); There is also the need to help the user understand
jesseengel@google.com (J. Engel) how to use Tone Transfer as well as the machine learn-
orcid:
© 2021 Copyright for this paper by its authors. Use permitted under Creative ing technology behind it. Therefore, we designed tips
CEUR
Workshop
http://ceur-ws.org
ISSN 1613-0073
Commons License Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org) that guide the user through the experience and edu-
cate them on the best ways to interact with it. The user
Proceedings
1 g.co/magenta/ddsp
Figure 1: The web user interface of Tone Transfer
can also learn the training process of machine learning ing inference we use the SPICE model, which is faster
models by clicking the "Discover more" button. and has an implementation available in TF.js (https:
//tfhub.dev/google/tfjs-model/spice/2/default/1).
While the original DDSP paper used perceptually
3. Models weighted spectrograms for loudness, we find that the
root-mean-squared (RMS) power of the waveform works
At a technical level, the goal of our system is to be
well as a proxy and is less expensive to compute. We
able to create a monophonic synthesizer that can take
train on 16kHz audio, with a hop size of 64 samples
coarse user inputs of Pitch and Loudness and convert
(4ms) and a forward-facing (non-centered) frame size
them into detailed synthesizer coefficients that pro-
of 1024 samples (64ms). We convert power to decibels,
duce realistic sounding outputs.
and scale pitch and power to the range [0, 1] before
We find this is possible with a carefully designed
passing the features to the decoder.
variant of the standard Autoencoder architecture, where
we train the model to:
3.2. Decoder Network
• Encode: Extract pitch and loudness signals from
audio. The decoder converts the encoded features (𝑓0 , power)
into synthesizer controls for each frame of audio (250Hz,
• Decode: Use a network to convert pitch and loun- 4ms). As we discuss in Section 3.3, for the DDSP mod-
dess into synthesizer controls. els in this work, the synthesizer controls are the har-
monic amplitude (𝐴), harmonic distribution (𝑐𝑘 ), and
• Synthesize: Use DDSP modules to convert syn- filtered noise magnitudes.
thesizer controls to audio. The DDSP modules are agnostic to the model archi-
tecture used and convert model outputs to desired con-
We then compare the synthesized audio to the orig- trol ranges using custom nonlinearities as described
inal audio with a multi-scale spectrogram loss [10, 8, in [8].
11] to train the parameters of the decoder network. We use two stacks of non-causal dilated convolution
layers as the decoder. Each stack begins with a non-
3.1. Encoding Features dilated input convolution layer, followed by 8 layers,
with a dilation factor increasing in powers of 2 from 1
To extract pitch during training (fundamental frequency,
to 128. Each layer has 128 channels and a kernel size
𝑓0 ), we use a pretrained CREPE network [12]. Dur-
Figure 2: A diagram of the DDSP autoencoder training. Source audio is encoded to a 2-dimensional input feature (pitch
and power), that the decoder converts to a 126-dimensional synthesizer controls (amplitude, harmonic distribution, and
noise frequency response). We use the CREPE model for pitch detection during training and the SPICE model for pitch
detection during inference. These controls are synthesized by a filtered noise synthesizer and harmonic synthesizer, mixed
together, and run through a trainable reverb module. The resulting audio is compared against the original audio with a
multi-scale spectrogram loss. Blue components represent the source audio and resynthesized audio. Yellow components
are fixed components (pitch tracking, DDSP synthesizers, and loss function), green components are intermediate features
(decoder inputs and synthesizer controls), and red components have trainable parameters (decoder layers and reverb im-
pulse response).
of 3, and is followed by layer normalization [13], and a
ReLU nonlinearity [14]. The scale and shift of the layer 𝐾 −1
𝑥(𝑛) = ∑ 𝐴𝑘 (𝑛) sin(𝜙𝑘 (𝑛)), (1)
normalization are controlled by the pitch and power
𝑘=0
conditioning after it is run through a 1x1 convolution
with 128 channels. The complete model has ∼ 830𝑘 where 𝜙𝑘 (𝑛) is its instantaneous phase obtained by cu-
trainable parameters. mulative summation of the instantaneous frequency
𝑓𝑘 (𝑛):
𝑛
3.3. Differentiable Synthesizers 𝜙𝑘 (𝑛) = 2𝜋 ∑ 𝑓𝑘 (𝑚), (2)
𝑚=0
To generate audio, we use a combination of additive
(Harmonic) and subtractive (Filtered Noise) synthesis The network outputs amplitudes 𝐴𝑘 and frequen-
techniques. Inspired by the work of [15], we model cies 𝑓𝑘 every 4ms, which are upsampled to audio rate
sound as a flexible combination of time-dependent si- (16kHz) using overlapping Hann windows and linear
nusoidal oscillators and filtered noise. DDSP makes interpolation respectively.
these operations differentiable for end-to-end training
by implementing them in TensorFlow [16]. Full details 3.3.2. Harmonic Synthesizer
can be found in the original papers [8, 9], but for clar-
Since we train on individual instruments with strong
ity, we review the main modules here. harmonic relationships of their partials, we can repa-
rameterize the sinusoidal oscillator bank as a harmonic
3.3.1. Sinusoidal Oscillators oscillator, with a single fundamental frequency 𝑓0 , am-
A sinusoidal oscillator bank is an additive synthesizer plitude 𝐴, and harmonic distribution 𝑐𝑘 . All the output
that consists of 𝐾 sinusoids with individually varying frequencies are constrained to be harmonic (integer)
amplitudes 𝐴𝑘 and frequencies 𝑓𝑘 . These are flexibly multiples of a fundamental frequency (pitch),
specified by the output of a neural network over 𝑛 dis-
crete time steps (250Hz, 4ms per frame): 𝑓𝑘 (𝑛) = 𝑘𝑓0 (𝑛) (3)
Individual amplitudes are deterministically retrieved 3.4.1. Data
by multiplying the total amplitude, 𝐴(𝑛), with the nor-
We train models on four instruments: Violin, Flute,
malized distribution over harmonic amplitudes, 𝑐𝑘 (𝑛):
Trumpet, and Saxophone. Following [17] and [8], we
𝐴𝑘 (𝑛) = 𝐴(𝑛)𝑐𝑘 (𝑛). (4) use home recordings of Trumpet and Saxophone for
training, and collected performances of Flute and Vio-
where , lin from the MusOpen royalty free music library 2 .
Since DDSP models are efficient to train, for each
𝐾 −1 instrument we only need to collect between 10 and
∑ 𝑐𝑘 (𝑛) = 1, 𝑐𝑘 (𝑛) ≥ 0 (5) 15 minutes of performance, and we ensure a that all
𝑘=0
recordings are from the same room environment to al-
low training a single reverb impulse response.
3.3.3. Filtered Noise Synthesizer
We can model the non-periodic audio components as 3.4.2. Optimization
a subtractive synthesizer, with a linear time-varying
filtered noise source. White noise is generated from a We train models with the Adam optimizer [18], exam-
uniform distribution, which we then filter with an Fi- ples 4 seconds in length, batch size of 128, and learning
nite Impulse Response (FIR) filter. Since the network rate of 3e-4. As we would like to use models to gener-
outputs different coefficients of the frequency response alize to new types of pitch and loudness inputs, we re-
in each frame, it creates an expressive time-varying fil- duce overfitting through early stopping, typically be-
ter. tween 20k and 40k iterations.
3.3.4. Reverb 4. Interactive Models
To first approximation, room responses with fixed source
and listener locations can be approximated by a single 4.1. On-device Inference with
impulse response that can be applied as a FIR filter. In Magenta.js
terms of neural networks, this is equivalent to a 1-D Musical interaction has strong requirements for close
convolution with a very large receptive field (∼40k). to real-time feedback and low latency. However, ma-
We treat the impulse response as a learned variable, chine learning models are typically slow and compu-
and train a new response (jointly with the rest of the tationally expensive, requiring GPU or TPU servers
model) for each dataset with a unique recording envi- to run at all. Further, large model sizes lead to long
ronment. load times before execution can even begin. Running
To better disentangle the signal from the room re- models on-device, if possible, eliminates serving costs,
sponse, we generate the impulse response with a fil- decreases interactive latency, and increases accessibil-
tered noise synthesizer as described in Section 3.3.3, ity. To create an interactive and scalable musical ex-
and learn the transfer function coefficients to gener- perience, we optimized and converted models to be
ate a desired impulse response. This prevents coherent compatible with Tensorflow.js so that they can run on-
impulse responses at short time scales that can entan- device in the browser on both desktop and mobile de-
gle the frequency response of the synthesizer with the vices.
room response. At inference, we discard the expen- Even after optimization, the models are still rela-
sive convolutional reverb component to synthesize the tively large (4mb each), so each model is only loaded
"dry" signal, and apply a more efficient stock reverb ef- on demand. This ensured the user downloads only the
fect. things they need, and nothing more, which resulted in
a fast and responsive website.
3.4. Training The methods to extract pitches, and the four mod-
els that are on the website are then open sourced and
Given that the DDSP model described above is for mono-
phonic instruments, we collect data of individual in- 2 Violin: Five pieces by John Garner (II. Double, III. Corrente,
struments, and train a separate model for each dataset. IV. Double Presto, VI. Double, VIII. Double, Flute: Four pieces
by Paolo Damoro (24 Etudes for Flute, Op. 15 - III. Allegro con
brio in G major, 24 Etudes for Flute, Op. 15 - VI. Moderato in
B minor, 3 Fantaisies for Solo Flute, Op. 38 - Fantaisie no. 1,
Sonata Appassionata, Op. 140)) from https://musopen.org/music/
13574-violin-partita-no-1-bwv-1002/
made easier for anyone to download and run their own Acknowledgments
experiences. Each model comes with a set of custom
values that are manually tweaked to create a more ac- We would like to acknowledge the contributions of ev-
curate output. eryone who made Tone Transfer possible, including
These methods are added to the Magenta.js library.3 Lamtharn (Hanoi) Hantrakul, Doug Eck, Nida Zada,
Mark Bowers, Katie Toothman, Edwin Toh, Justin Secor,
Michelle Carney, and Chong Li, and many others at
4.2. Custom TF.JS Kernels to Preserve Google. Thank you.
Precision
TensorFlow.js is a web ML platform that provides hard-
ware acceleration through web APIs like WebGL and
References
WebAssembly. DDSP relies on TensorFlow.js to speed [1] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan,
up the model execution. To maintain accuracy of DDSP O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior,
model on a variety of devices, we implemented a cou- K. Kavukcuoglu, Wavenet: A generative model
ple of special kernels that eliminated overflow (𝑎𝑏𝑠(𝑛) > for raw audio, arXiv preprint arXiv:1609.03499
65504) and underflow (𝑎𝑏𝑠(𝑛) < 2−10 ) of float16 texture (2016).
when running on the TensorFlow.js WebGL backend. [2] J. Engel, C. Resnick, A. Roberts, S. Dieleman,
For example, the DDSP model uses TensorFlow Cum- D. Eck, K. Simonyan, M. Norouzi, Neural audio
sum op to calculate the cumulative summation of the synthesis of musical notes with WaveNet autoen-
instantaneous frequency, then obtain the phase from coders, in: ICML, 2017.
those values. TensorFlow.js implements a parallel al- [3] J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani,
4
gorithm for cumulative sum, which requires log(n) C. Donahue, A. Roberts, GANSynth: Adversarial
writes of intermediate tensors to the GPU textures. neural audio synthesis, in: International Confer-
The cumulative precision loss would cause a large shift ence on Learning Representations, 2019.
on the final phase values. The solution is to register a [4] N. Mor, L. Wolf, A. Polyak, Y. Taigman, A uni-
custom Cumsum op that uses a serialized algorithm versal music translation network, arXiv preprint
that avoids all intermediate texture writes and is in- arXiv:1805.07848 (2018).
corporated with the phase computation. [5] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury,
N. Casagrande, E. Lockhart, F. Stimberg, A. v. d.
5. Conclusion and Future Work Oord, S. Dieleman, K. Kavukcuoglu, Effi-
cient neural audio synthesis, arXiv preprint
Tone Transfer is an example of an interdisciplinary de- arXiv:1802.08435 (2018).
sign, engineering, and AI research teams working to- [6] L. H. Hantrakul, J. Engel, A. Roberts, C. Gu, Fast
gether to create a User Interface Design for the next and flexible neural audio synthesis., in: ISMIR,
wave of AI. We leverage state-of-the-art machine learn- 2019.
ing models that are both expressive and efficient, and [7] A. Oord, Y. Li, I. Babuschkin, K. Simonyan,
optimize them for client-side use to enable interactive O. Vinyals, K. Kavukcuoglu, G. Driessche,
neural audio synthesis on the web. This work demon- E. Lockhart, L. Cobo, F. Stimberg, et al., Paral-
strates that on-device machine learning can enable in- lel wavenet: Fast high-fidelity speech synthesis,
teractive and creative music making experiences for in: International conference on machine learn-
novices and musicians alike. The technologies that ing, PMLR, 2018, pp. 3918–3926.
power Tone Transfer have also been open sourced as [8] J. Engel, L. H. Hantrakul, C. Gu, A. Roberts, Ddsp:
a part of Magenta.js and provide a solid foundation for Differentiable digital signal processing, in: In-
further interactive studies. Future work will hopefully ternational Conference on Learning Representa-
allow users to train their own models based on their tions, 2020.
own instruments, and explore using new types of in- [9] J. Engel, R. Swavely, L. H. Hantrakul, A. Roberts,
puts to create multi-sensory experiences. C. Hawthorne, Self-supervised pitch detection by
inverse audio synthesis (2020).
[10] P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Rad-
3 https://github.com/magenta/magenta-js/tree/master/music# ford, I. Sutskever, Jukebox: A generative model
ddsp for music, arXiv preprint arXiv:2005.00341
4 https://en.wikipedia.org/wiki/Prefix_sum#Parallel_
(2020).
algorithms
[11] X. Wang, S. Takaki, J. Yamagishi, Neu-
ral source-filter waveform models for statisti-
cal parametric speech synthesis, arXiv preprint
arXiv:1904.12088 (2019).
[12] J. W. Kim, J. Salamon, P. Li, J. P. Bello, Crepe:
A convolutional representation for pitch esti-
mation, in: 2018 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing
(ICASSP), IEEE, 2018, pp. 161–165.
[13] J. L. Ba, J. R. Kiros, G. E. Hinton, Layer normal-
ization, arXiv preprint arXiv:1607.06450 (2016).
[14] V. Nair, G. E. Hinton, Rectified linear units im-
prove restricted boltzmann machines, in: ICML,
2010.
[15] X. Serra, J. Smith, Spectral modeling synthesis: A
sound analysis/synthesis system based on a de-
terministic plus stochastic decomposition, Com-
puter Music Journal 14 (1990) 12–24.
[16] M. Abadi, A. Agarwal, P. Barham, E. Brevdo,
Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,
M. Devin, S. Ghemawat, I. Goodfellow, A. Harp,
G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,
M. Kudlur, J. Levenberg, D. Mané, R. Monga,
S. Moore, D. Murray, C. Olah, M. Schuster,
J. Shlens, B. Steiner, I. Sutskever, K. Talwar,
P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vié-
gas, O. Vinyals, P. Warden, M. Wattenberg,
M. Wicke, Y. Yu, X. Zheng, TensorFlow: Large-
scale machine learning on heterogeneous sys-
tems, 2015. URL: https://www.tensorflow.org/,
software available from tensorflow.org.
[17] G. AIUX Scouts, G. Magenta, Tonetransfer,
https://sites.research.google/tonetransfer, 2020.
Accessed: 2020-12-10.
[18] D. P. Kingma, J. Ba, Adam: A method for stochas-
tic optimization, arXiv preprint arXiv:1412.6980
(2014).