Learning Latent Representations of Music to Generate Interactive Musical Palettes Adam Roberts Jesse Engel Sageev Oore∗ Google Brain Google Brain Dalhousie University Mountain View, USA Mountain View, USA & Vector Institute adarob@google.com jesseengel@google.com Canada sageev@vectorinstitute.ai Douglas Eck Google Brain Mountain View, USA deck@google.com ABSTRACT of this, for example, we could represent sequences of mono- Advances in machine learning have the potential to radically phonic 16th notes of equal intensity with approximately 7 bits reshape interactions between humans and computers. Deep for each note, or 112 per bar. That is far fewer dimensions than learning makes it possible to discover powerful representa- audio, but even there, exploring all possible variations of a tions that are capable of capturing the latent structure of high- score by flipping one bit at a time quickly becomes intractable, dimensional data such as music. By creating interactive la- and further, would result in a large proportion of melodies tent space “palettes” of musical sequences and timbres, we being so musically unconventional that they would easily be demonstrate interfaces for musical creation made possible perceived as being incoherent. by machine learning. We introduce an interface to the intu- While the high dimensionality affords an exponential num- itive, low-dimensional control spaces for high-dimensional ber of possibilities, only some of these possibilities are likely note sequences, allowing users to explore a compositional for real music, which could be seen as residing on a lower- space of melodies or drum beats in a simple 2-D grid. Fur- dimensional manifold within the space. Machine learning thermore, users can define 1-D trajectories in the 2-D space techniques can learn the shape of such low-dimensional mani- for autonomous, continuous morphing during improvisation. folds from data, and be used to gain better understanding of Similarly for timbre, our interface to a learned latent space and to explore large datasets [22]. They can also be used to of audio provides an intuitive and smooth search space for build creative tools within the realm of “Artificial Intelligence morphing between the timbres of different instruments. We Augmentation” (AIA) [7]. Learning the reduction directly remove technical and computational barriers by embedding from the data allows us to avoid heuristics and hand-tuned pre-trained networks into a browser-based GPU-accelerated features, along with the biases and preconceptions about the framework, making the systems accessible to a wide range of data that would normally accompany those. users while maintaining potential for creative flexibility and personalization. Autoencoders and variational autoencoders [13] are models designed to learn efficient, low-dimensional representations Author Keywords capable of reproducing their high-dimensional inputs. The musical interface; latent space; variational autoencoder; deep hope is that in order to effectively utilize the “latent space”, learning they will learn a mapping that is “effective”. What do we mean by this, or rather, what might be some desirable characteristics INTRODUCTION for such a mapping? First we might wish for smoothness: for Music, when treated as data, is often represented in high- example, if two points are near each other in latent space, then dimensional spaces. Digital music formats such as CD-quality we would like for their corresponding points in the output Pulse-code modulation (PCM), for example, records audible space to also be near one another. In the constrained case of vibrations as a discrete sequence of 44.1 thousand 16-bit inte- monophonic melodies mentioned above, this would mean that gers per second [11]; audio can then be modelled by treating we would like for the two monophonic sequences to be per- each sample as a unique dimension and capturing correla- ceptually similar. Second, while the original high-dimensional tions between them. Alternately, musical compositions can be space allows for very unlikely points, we would like the latent communicated as a score; in one heavily constrained version space to correspond primarily to the likely ones: that is, if we ©2018. Copyright for the individual papers remains with the authors. map from a sampled point in the latent space to the original Copying permitted for private and academic purposes. space, we would like the resulting point to be “feasible”, i.e. MILC ’18, March 11, 2018, Tokyo, Japan not one of those unconventionally incoherent sequences that * Research conducted while this author was at Google Brain. we described earlier. If we can satisfy these two requirements, then that would allow interpolation in the latent space to corre- spond to a meaningful interpolation in the original space. For example, if A and B are score representations of two mono- phonic melodies, and f (A) and f (B) are their corresponding latent representations, then as we sample N + 1 points along the line between the latent points f (A) and f (B): ci = αi f (A) + (1 − αi ) f (B) Figure 1. Diagram of an autoencoder/VAE. Input (in this case an au- dio waveform) is mapped through an encoder function to a compressed latent vector z. Transformations, such as interpolation, can then be ap- where αi = i/N and i runs from 0 . . . N, then Ci = f −1 (ci ) plied to the vector. The resulting latent vector is then fed through a should always be a feasible (i.e. statistically “likely”) melody, decoder function to produce the output audio. and also Ci should be perceptually fairly similar to Ci+1 . That Closely related to our melodic interpolation is work by Bretan is, the smoothness of the latent space with respect to the et al. [4, 3]. The authors first extract a bag-of-features from outputs makes it possible to define projections of the out- monophonic musical phrases, and then use an autoencoder to put space onto 1-D line segments and 2-D rectangles [14, 7]. construct a latent space. By selecting nearest-neighbors from Tracing along such low-dimensional manifolds, we can thus the training set in latent space, the autoencoder can interpolate morph from one melody to another in an interesting way. Low- naturally between any two given 2-bar monophonic musical dimensional representations thus potentially afford interesting phrases, such that the gradual progression can be heard in both visualizations and natural interactions. the harmonic and rhythmic elements [4].They subsequently extend this to learn small perturbations in the latent space In this paper, we present two interfaces based on the principle to allow variety of real-time call-and-response interactions of musical latent spaces: for polyphonic musical phrases [3]. However, the approach • A musical sequence explorer for 2-bar melody loops and is somewhat less scalable and generalizable, as it can only drum beats, using the latent space of MusicVAE [20]. recombine elements from the training data and the nearest- neighbor selection scales poorly as the size of the training set • An instrument timbre explorer, using the audio-based latent grows. space of instrument samples generated by NSynth [8]. The Hyperviolin [12] is a physical, violin-like instrument that The former is implemented using deeplearn.js [21] for GPU- makes use of cross-synthesis to produce timbres mixing be- accelerated inference in the browser, enabling dynamic explo- tween various instruments such as a flute or choral vocals. The ration with no additional installation or setup required. It can developers demonstrated multiple methods of controlling the output both audio and MIDI. The latter is implemented as a relative weights of the high-level timbral features via “sen- Max For Live device with pre-synthesized samples, providing sor shoes”. The performer could control some constrained access to a massive number of timbres from within Ableton aspects of the timbre within a preset space in real-time with Live, a widely-used, professional-grade production tool. movements of her foot and could move between different pre- sets throughout a piece by making larger physical movements RELATED WORK along a path. There is a fair amount of prior work on the concept of “musical The Morph Table [6] uses a 3-D interface with physical cubes morphing” in the space of both audio and compositions. In the allowing multiple users to control the musical output of the audio realm, techniques such as cross-synthesis use heuristics system by morphing in both audio and compositional spaces. or machine learning to mix hand-designed features of instru- Movements of each cube in one dimension are associated ment timbres [12]. For compositions, heuristics and models with compositional morphing using techniques such as key- based on music theory are used to morph between existing modulation, linear interpolation of note features (pitch, onset, pieces [5, 9]. duration, etc.) or “cross-fading” between subsections. In a MMorph [19] is described as a real-time tool that allows users second dimension, standard audio effects are are adjusted. to morph between up to four MIDI compositions by mixing Movements in a third dimension–accessed by rotating the different elements of each piece (such as rhythm, pitch, dy- cubes to expose different faces–changes the morphing end- namics, timbre, harmony). Via a SmallTalk GUI, a user could points between six different presets. specify which elements of each MIDI file to morph with check- Existing examples of AIA interfaces using latent spaces in- boxes and drag a control around a 2-D space to determine the clude examples for generating faces [14], fonts [7], and generic relative amounts contributed by each of the four pieces in images such as shoes and landscapes [23]. real-time. The user could also choose from several different morphing methods such as “interpolation” and “weighting”, METHODS although no description of these methods is provided. Both of the interfaces introduced in this paper employ autoen- Hamanaka et al. [9] introduce a simple copy-and-paste inter- coders, unsupervised machine learning models each composed face for both interpolating between and extrapolating beyond of three parts: an encoder, a latent vector, and a decoder (Fig- pairs of melodies by reducing the melodies to a low-dimension ure 1). The latent vector, z, is an intermediate representation representations based on music theory. of the data examples, x, but is either of lower dimension or is regularized to produce an information bottleneck. In the case of variational autoencoders (VAE), z is regularized to encour- age the model to learn low-entropy latent representations of x close to a standard multivariate Gaussian distribution. The encoder is a neural network that produces z from the input x and the decoder is a separate neural network that attempts to reconstruct x from z. Gradient-based optimization is then used to reduce the reconstruction loss (the difference between the encoded input and decoded output) and KL divergence from the Gaussian prior (if applicable) [13]. The purpose of the bottleneck is that it forces the model to distill the information content of the data to lower-entropy rep- resentations. In the process, it discovers and encodes the most Figure 2. The MusicVAE Sequencer. On the left is an editable sequencer for modifying the corner melodies or rhythms, along with several con- important features that differentiate data examples. The model trols including a toggle between melody and drum modes and a drop- becomes specialized to produce examples like those from the down selector for MIDI output. On the right is the 2-D interpolation training distribution, making it difficult for the decoder to palette where each interior square contains a sequence generated by de- produce “unrealistic” outputs that are improbable under the coding an interpolated point from the latent space between the corner sequences. The interface is shown in “Draw” mode, where the the white distribution p(x). However, when trained properly it is general puck will follow along the user-drawn 1-D curve (in this case shaped like enough to be able to reconstruct and generate examples that an “M”), morphing the sequence in real-time as it moves through the are from the same distribution as p(x) but do not appear in the latent space. train set. The resulting latent space is therefore optimized for various forms of exploration including interpolation within the We train the NSynth model on the NSynth dataset, a collection latent space, which can “morph” between values in the output of ~300,000 individual 4-second notes recorded from ~1,000 space, producing realistic intermediate outputs that combine separate instruments and synthesizers [8]. By choosing to train features of the endpoints to varying degrees. on individual pitched notes, the model can then be used as a “neural synthesizer” to playback notes when receiving an input MusicVAE MIDI signal. MusicVAE is a VAE for musical sequences, such as drum beats Since the model learns an expressive code for raw audio, it and melodies. It uses LSTM [10] recurrent neural networks can be used to interpolate in this space and discover new as its encoder and decoder. For our purposes, we focus on instrument timbres that exist between pre-existing instruments. models learned from either 2-bar drum beats or 2-bar melody However, while the latent space is of much lower dimension loops. Training sets were obtained by scraping the web for than the original audio, there is no prior to sample from as was MIDI files and extracting all unique 2-bar sequences of the the case for MusicVAE. Therefore, there is no trivial way to two types, resulting in 28 million melodies and 3.8 million randomly sample novel musical notes from the distribution, drum beats. and we are limited to exploring subspaces anchored on known To produce good reconstructions (and interpolations), we train notes. a VAE with a trade-off parameter that assigns a lower weight to the KL divergence, thus allowing the system to pass enough INTERFACES information through the bottleneck to be able to reproduce nearly any realistic input sequence in our dataset. However, MusicVAE Sequencer models trained in this way on noisy datasets such as ours often The MusicVAE Sequencer (Figure 2) is an interface to the la- do not produce realistic examples when sampling random tent spaces of 2-bar drum loops and monophonic melodies, as points from the Gaussian prior. To produce better random described above. Users can toggle between these two spaces samples, we train VAEs with the trade-off parameter set to a and define the four corner sequences of a 2-D, 11x11 grid by value that encourages a better KL divergence at the expense randomly sampling from the latent space (using the low-KL of reconstruction quality. model), selecting from a predefined list, or inputting them manually into the sequencer at the resolution of 16th notes. Once the corner sequences are set, they are passed through NSynth the encoder of the VAE to determine their latent vectors. The With NSynth, we are able to learn latent spaces not just of latent vectors are then mixed using bi-linear interpolation in musical sequences, but of audio waveforms themselves. Like the 11 × 11 space, and the resulting values are decoded into se- the MusicVAE, NSynth is an autoencoder architecture (though quences (from the high-KL model). With our implementation not variational) that can learn to encode and decode sequences of the model architecture in deeplearn.js using weights learned from a compressed latent space. Since the model must learn via the original TensorFlow [2] model, the encoding, inter- to most efficiently use the lower-entropy representation to polation, and decoding can all be executed with sub-second represent waveforms, it learns salient acoustic features among latency on a typical consumer laptop. the training data. However, the model architecture differs significantly from MusicVAE as it must capture correlations Now that the palette is filled, the user can drag the white “puck” over thousands of quantized waveform timesteps [8]. around to hear the drum beat or melody loop in each square Incoming MIDI notes are played back in the appropriate tim- bre based on the position on the palette. Additional controls include the ability to adjust where each sample the playback should start, the playback speed, and settings for an Attack Figure 3. The NSynth Instrument alongside other Ableton devices that Decay Sustain Release (ADSR) envelope. Furthermore, the it can be used in conjunction with to synthesize incoming MIDI events. NSynth Instrument can be combined with other Live devices to add additional effects such as an arpeggiator. of the grid. When the puck moves to a new square, the se- quencer immediately updates but the play-head does not reset Finally, users are able to map all of these settings to an external to the beginning, allowing the music to morph immediately MIDI controller for ease of use in a studio or live setting. but smoothly. The sound is played via the browser using preset audio samples with Tone.js [15] and can optionally be output EXPLORING THE INTERFACE as MIDI events via Web MIDI [1] to be further manipulated Interfaces provide mappings from the user’s actions to the and synthesized externally. output space. In the case of the MusicVAE sequencer, the input The corner sequences can be changed at any time by the same space consists of four user-defined basis points (the corners) methods mentioned above or by dragging one of the interior combined with a 2-D control surface (the palette), and the squares to a corner of the grid, which will set it as the new output space consists of note or drum sequences. Generally, corner and cause the interior points to be updated. choosing the mapping is a crucial element of designing a user interface, with numerous trade-offs at play, and where If using the sequencer as a composition tool, the user can the goals often include carefully and deliberately shaping the record the MIDI or audio externally, or she can right click on user’s cognitive model of the system [17]. In our case, the any square to download a MIDI file containing that particular mapping is learned with a VAE, and therefore this raises the sequence to use as a part of an arrangement. question of how the user will understand and interact with it: from the user’s perspective, how does this mapping work? In the case of improvisation or live performance, the user [18] describes some of the steps that a user may take when may also use the “draw” tool to define a 1-D curve within the learning continuous mappings for complex controls over high- palette, which the puck will then move through at a rate the dimensional output spaces; we follow some of those basic user controls. As the puck moves into each new square, the approaches to explore the MusicVAE Sequencer interface. In playback sequence is updated, just as if the user had moved particular, we show how to start isolating and exaggerating the puck there by hand. In this manner, a performer can aspects of the mapping. set up paths in both the drum and melody palettes that will continuously evolve at potentially different rates, introducing One example of exaggeration is that we set three of the basis musically interesting phasing effects. points (bottom-left, top-left, and top-right) to all be empty, and set the bottom-right to be entirely full (i.e. every percussion instrument playing at every 16th note). In that case, our inter- NSynth Instrument polation lets us check what the system does as it goes from The NSynth model is extremely computationally expensive completely sparse (silence) to completely dense, as shown in (~30 minutes of GPU synthesis to generate four seconds of Figure 4. audio) and therefore presented a distinct challenge in creating an interactive experience for interpolating within the latent As we interpolate from empty to full, we notice that the system space in real-time on a laptop. Rather than generating sounds tends to start by populating the sequencer with percussion on demand, we curated a set of original sounds ahead of time hits on the actual beats, then on the 8th notes, and then on and synthesized all of their interpolated latent representations. the 16th notes, i.e. it progresses from more likely to less To produce even finer-grained resolution during playback, tran- likely beats. It also progresses from sparser beat sequences sitions are smoothed out by additionally mixing the audio in to denser ones. This satisfies the requirements outlined in the real-time from the nearest neighbor sounds on the grid. This is introduction as follows: (1) smoothness is preserved, in that a straightforward case of trading off computation for memory. each sample along the interpolated latent path corresponds to beat sequences that are similar to (i.e. a little more or less We created a playable instrument integrated into Ableton Live dense than) those corresponding to its adjacent samples, and as a Max For Live device (Figure 3). We positioned real (2) by first adding beats on quarter notes, then 8th notes, etc, instrument timbres at the corners of a square grid, allowing this suggests that the system is maintaining feasible (i.e. more the user to mix between all four using the latent space as a likely) outputs1 . palette. We created further “multigrid” modes, tiling many four-instrument grids side by side. This results in an 7x7 grid Figure 5 shows another simple experiment in which we try of 4x4 grids, enabling a user to explore up to 64 different isolating the location (phase) of the beat in the bar. We see that instruments by dragging across a single x-y pad. Given an x-y 1 For example, one could imagine a sequence that goes from sparse point, the device finds the nearest four interpolated samples and plays them back with volume proportional to the their to dense, but in which the new percussion hits are added by simply choosing random cells in the grid from a uniform distribution over x-y distance from the point. Users can choose from 5 pre- the grid. This could still be a smooth mapping, but it would have the generated grids and 3 multigrids or produce their own using problem that the intermediate beat sequences, during interpolation, the TensorFlow code released with the NSynth model [16]. would generally be very unlikely. Figure 4. Interpolating between empty and full. The dark grid on the left represents the piano roll of the percussion, and the multi-coloured square grid on the right (the palette) represents the (2-dimensional) latent space. The white puck represents a point in the latent space, and the piano roll shows that point’s corresponding percussion sequence. For all six sub-figures here, the 4 corners of the latent space grid are fixed: the bottom-right corner has all drum hits enabled, and the other three corners are empty. For example, in sub-figure at (Row 2, Col 1), the puck is just next to the bottom-right corner, and indeed, the corresponding piano roll shows a very dense sequence of percussion hits. Conversely, in (Row 1, Col 2), the puck is approximately near the centre, and we see a much sparser pattern. As the puck is moved from an empty corner toward the bottom-right, drum hits are added first on quarter notes, then eighth notes, and finally 16th notes, until the sequencer eventually fills as it nears the corner. The final two images on the second row (Row 2, Col 2) and (Row 2, Col 3) illustrate the expected symmetry with respect to the diagonal for this configuration (i.e. the percussion rolls are essentially identical for these two points on either side of the (top-left, bottom-right) diagonal. rather than interpolating the phase (which would gradually with high accuracy. While it is likely that such systems could shift the offset), there is a tendency to superpose the two rhyth- be used to produce satisfying music for listeners, our work mic figures. This, too, is a reasonable rhythmic approach. We indicates directions for how we might use machine learning— note that doing so in melody space would not necessarily be as specifically latent-space representations— to create UI map- natural of a choice (and in fact the system would not have that pings that we hope will eventually provide music creators with option since the melodies are constrained to be monophonic). new and interesting creativity-supporting tools. While there is usually a lot more sophistication occurring in the interpolations aside from superposition, these results for ACKNOWLEDGMENTS simple cases have a natural interpretation for the user, provid- We thank Torin Blankensmith and Kyle Phillips from Google ing landmarks on which to ground their understanding and Creative Lab for implementing the MusicVAE Sequencer UI. cognitive models of the system. We thank Nikhil Thorat for assistance with the deeplearn.js implementation. We thank the members of the Magenta team In Figure 6, we explore interpolation in melody space by for helpful discussions and observations. providing several different scales as the four basis points. As we move from top-left to top-right, we notice that at each step REFERENCES the output sequence is similar to the previous step, but that 1. 2015. Web MIDI API. the combination of these smooth steps ultimately results in a http://webaudio.github.io/web-midi-api/. (2015). Accessed: very large change. Interestingly, at around a third of the way 2017-12-21. across, the output sequence includes some descending notes, 2. Martín Abadi and others. 2015. TensorFlow: Large-Scale Machine even though both scales ascend. This shows that the system Learning on Heterogeneous Distributed Systems. White Paper. Google. will sometimes move away from the basis points during the http: //download.tensorflow.org/paper/whitepaper2015.pdf interpolation. 3. Mason Bretan, Sageev Oore, Jesse Engel, Douglas Eck, and Larry Heck. Finally, we note that some parameters, such as the number of 2017a. Deep Music: Towards Musical Dialogue. In Proceedings of the grid points, were chosen in order to make the system func- Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17). tional, with the focus on learning an effective latent space 4. Mason Bretan, Gil Weinberg, and Larry Heck. 2017b. A Unit Selection mapping, but in future, these are design elements that would Methodology for Music Generation Using Deep Neural Networks. In International Conference on Computational Creativity (ICCC 2017). be worthwhile exploring more systematically and with user testing. 5. Andrew Brown and Rene Wooller. 2005. Investigating Morphing Algorithms for Generative Music. In Proceedings of Third Iteration, T Innocent (Ed.). Centre for Electronic Media Art, Australia, Victoria, AVAILABILITY Melbourne, 189–198. https://eprints.qut.edu.au/24696/ Supplementary resources including opensource code are avail- 6. Andrew Brown, René Wooller, and Kate Thomas. 2007. The Morph able for both NSynth and MusicVAE interfaces in correspond- Table: A collaborative interface for musical interaction. In Australasian ing sub-directories of Magenta’s demo github repository2 . Computer Music Conference (ACMC). 34–39. 7. Shan Carter and Michael Nielsen. 2017. Using Artificial Intelligence to CONCLUSION Augment Human Intelligence. Distill (2017). This work demonstrates the use of machine learning as the http://distill.pub/2017/aia basis for creative musical tools. Machine learning is often seen 8. Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas as a method for outsourcing mundane discriminative tasks, re- Eck, Karen Simonyan, and Mohammad Norouzi. 2017. Neural Audio sulting in the desire for rigid systems that perform their duties Synthesis of Musical Notes with WaveNet Autoencoders. In International Conference on Machine Learning (ICML). 2 https://github.com/tensorflow/magenta-demos/ https://arxiv.org/abs/1704.01279 Figure 5. Interpolation between playing “on” beats and playing “off” beats shows a moment of superposition, which we have observed with some other basic rhythmic patterns as well, and which has some natural interpretations. Figure 6. Here we interpolate in melody space between four different scale patterns. 9. Masatoshi Hamanaka, Keiji Hirata, and Satoshi Tojo. 2009. Melody 17. Donald A. Norman. 2002. The Design of Everyday Things. Basic Books, morphing method based on GTTM. In International Computer Music Inc., New York, NY, USA. Conference (ICMC). 89–92. 18. Sageev Oore. 2005. Learning Advanced Skills on New Instruments. In http://hdl.handle.net/2027/spo.bbp2372.2009.020 International Conference on New Interfaces for Musical Expression 10. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term (NIME). Vancouver, Canada. Memory. Neural Computation 9, 8 (1997), 1735–1780. DOI: 19. Daniel V. Oppenheim. 1995. Demonstrating MMorph: A System for http://dx.doi.org/10.1162/neco.1997.9.8.1735 Morphing Music in Real-time. In International Computer Music 11. IEC 60908 1987. Audio recording - Compact disc digital audio system . Conference (ICMC). 479–480. Standard. International Electrotechnical Commission (IEC). 20. Adam Roberts, Jesse Engel, and Douglas Eck. 2017. Hierarchical 12. Tristan Jehan. 2001. Perceptual synthesis engine: an audio-driven timbre Variational Autoencoders for Music. In NIPS Workshop on Machine generator. Master’s thesis. Massachusetts Institute of Technology (MIT). Learning for Creativity and Design. http://hdl.handle.net/1721.1/61543 https://nips2017creativity.github.io/doc/Hierarchical_ 13. Diederik P. Kingma and Max Welling. 2013. Auto-Encoding Variational Variational_Autoencoders_for_Music.pdf Bayes. In International Conference on Learning Representations (ICLR). 21. Daniel Smilkov, Nikhil Thorat, and Charles Nicholson. 2017. http://arxiv.org/abs/1312.6114 deeplearn.js: a hardware-accelerated machine intelligence library for the 14. Ian Loh and Tom White. 2017. TopoSketch: Drawing in Latent Space. In web. https://deeplearnjs.org/. (2017). Accessed: 2017-12-21. NIPS Workshop on Machine Learning for Creativity and Design. https: 22. Laurens van der Maaten, Eric Postma, and Jaap van den Herik. 2009. //nips2017creativity.github.io/doc/TopoSketch.pdf Dimensionality reduction: A comparative review. Technical Report. 15. Yotam Mann. 2015. Interactive Music with Tone.js. In Web Audio TiCC, Tilburg University. Conference (WAC). 23. Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and Alexei A. Efros. http://wac.ircam.fr/pdf/wac15_submission_40.pdf 2016. Generative visual manipulation on the natural image manifold. In 16. Parag Mital. 2017. Generate your own sounds with NSynth. European Conference on Computer Vision (ECCV). https://magenta.tensorflow.org/nsynth-fastgen. (2017). https://arxiv.org/abs/1609.03552 Accessed: 2017-12-21.