=Paper=
{{Paper
|id=Vol-2068/milc7
|storemode=property
|title=Learning Latent Representations of Music to Generate Interactive Musical Palettes
|pdfUrl=https://ceur-ws.org/Vol-2068/milc7.pdf
|volume=Vol-2068
|authors=Adam Roberts,Jesse Engel,Sageev Oore,Douglas Eck
|dblpUrl=https://dblp.org/rec/conf/iui/RobertsEOE18
}}
==Learning Latent Representations of Music to Generate Interactive Musical Palettes==
<pdf width="1500px">https://ceur-ws.org/Vol-2068/milc7.pdf</pdf>
<pre>
       Learning Latent Representations of Music to Generate
                    Interactive Musical Palettes
              Adam Roberts                                          Jesse Engel                         Sageev Oore∗
               Google Brain                                        Google Brain                     Dalhousie University
            Mountain View, USA                                  Mountain View, USA                    & Vector Institute
            adarob@google.com                                 jesseengel@google.com                       Canada
                                                                                                  sageev@vectorinstitute.ai

                                                                 Douglas Eck
                                                                 Google Brain
                                                              Mountain View, USA
                                                               deck@google.com


ABSTRACT                                                                  of this, for example, we could represent sequences of mono-
Advances in machine learning have the potential to radically              phonic 16th notes of equal intensity with approximately 7 bits
reshape interactions between humans and computers. Deep                   for each note, or 112 per bar. That is far fewer dimensions than
learning makes it possible to discover powerful representa-               audio, but even there, exploring all possible variations of a
tions that are capable of capturing the latent structure of high-         score by flipping one bit at a time quickly becomes intractable,
dimensional data such as music. By creating interactive la-               and further, would result in a large proportion of melodies
tent space “palettes” of musical sequences and timbres, we                being so musically unconventional that they would easily be
demonstrate interfaces for musical creation made possible                 perceived as being incoherent.
by machine learning. We introduce an interface to the intu-
                                                                          While the high dimensionality affords an exponential num-
itive, low-dimensional control spaces for high-dimensional
                                                                          ber of possibilities, only some of these possibilities are likely
note sequences, allowing users to explore a compositional
                                                                          for real music, which could be seen as residing on a lower-
space of melodies or drum beats in a simple 2-D grid. Fur-
                                                                          dimensional manifold within the space. Machine learning
thermore, users can define 1-D trajectories in the 2-D space
                                                                          techniques can learn the shape of such low-dimensional mani-
for autonomous, continuous morphing during improvisation.
                                                                          folds from data, and be used to gain better understanding of
Similarly for timbre, our interface to a learned latent space
                                                                          and to explore large datasets [22]. They can also be used to
of audio provides an intuitive and smooth search space for
                                                                          build creative tools within the realm of “Artificial Intelligence
morphing between the timbres of different instruments. We
                                                                          Augmentation” (AIA) [7]. Learning the reduction directly
remove technical and computational barriers by embedding
                                                                          from the data allows us to avoid heuristics and hand-tuned
pre-trained networks into a browser-based GPU-accelerated
                                                                          features, along with the biases and preconceptions about the
framework, making the systems accessible to a wide range of
                                                                          data that would normally accompany those.
users while maintaining potential for creative flexibility and
personalization.                                                          Autoencoders and variational autoencoders [13] are models
                                                                          designed to learn efficient, low-dimensional representations
Author Keywords                                                           capable of reproducing their high-dimensional inputs. The
musical interface; latent space; variational autoencoder; deep            hope is that in order to effectively utilize the “latent space”,
learning                                                                  they will learn a mapping that is “effective”. What do we mean
                                                                          by this, or rather, what might be some desirable characteristics
INTRODUCTION                                                              for such a mapping? First we might wish for smoothness: for
Music, when treated as data, is often represented in high-                example, if two points are near each other in latent space, then
dimensional spaces. Digital music formats such as CD-quality              we would like for their corresponding points in the output
Pulse-code modulation (PCM), for example, records audible                 space to also be near one another. In the constrained case of
vibrations as a discrete sequence of 44.1 thousand 16-bit inte-           monophonic melodies mentioned above, this would mean that
gers per second [11]; audio can then be modelled by treating              we would like for the two monophonic sequences to be per-
each sample as a unique dimension and capturing correla-                  ceptually similar. Second, while the original high-dimensional
tions between them. Alternately, musical compositions can be              space allows for very unlikely points, we would like the latent
communicated as a score; in one heavily constrained version               space to correspond primarily to the likely ones: that is, if we
©2018. Copyright for the individual papers remains with the authors.      map from a sampled point in the latent space to the original
Copying permitted for private and academic purposes.                      space, we would like the resulting point to be “feasible”, i.e.
MILC ’18, March 11, 2018, Tokyo, Japan                                    not one of those unconventionally incoherent sequences that
* Research conducted while this author was at Google Brain.
we described earlier. If we can satisfy these two requirements,
then that would allow interpolation in the latent space to corre-
spond to a meaningful interpolation in the original space. For
example, if A and B are score representations of two mono-
phonic melodies, and f (A) and f (B) are their corresponding
latent representations, then as we sample N + 1 points along
the line between the latent points f (A) and f (B):
                 ci = αi f (A) + (1 − αi ) f (B)                    Figure 1. Diagram of an autoencoder/VAE. Input (in this case an au-
                                                                    dio waveform) is mapped through an encoder function to a compressed
                                                                    latent vector z. Transformations, such as interpolation, can then be ap-
where αi = i/N and i runs from 0 . . . N, then Ci = f −1 (ci )      plied to the vector. The resulting latent vector is then fed through a
should always be a feasible (i.e. statistically “likely”) melody,   decoder function to produce the output audio.
and also Ci should be perceptually fairly similar to Ci+1 . That    Closely related to our melodic interpolation is work by Bretan
is, the smoothness of the latent space with respect to the          et al. [4, 3]. The authors first extract a bag-of-features from
outputs makes it possible to define projections of the out-         monophonic musical phrases, and then use an autoencoder to
put space onto 1-D line segments and 2-D rectangles [14, 7].        construct a latent space. By selecting nearest-neighbors from
Tracing along such low-dimensional manifolds, we can thus           the training set in latent space, the autoencoder can interpolate
morph from one melody to another in an interesting way. Low-        naturally between any two given 2-bar monophonic musical
dimensional representations thus potentially afford interesting     phrases, such that the gradual progression can be heard in both
visualizations and natural interactions.                            the harmonic and rhythmic elements [4].They subsequently
                                                                    extend this to learn small perturbations in the latent space
In this paper, we present two interfaces based on the principle     to allow variety of real-time call-and-response interactions
of musical latent spaces:                                           for polyphonic musical phrases [3]. However, the approach
• A musical sequence explorer for 2-bar melody loops and            is somewhat less scalable and generalizable, as it can only
  drum beats, using the latent space of MusicVAE [20].              recombine elements from the training data and the nearest-
                                                                    neighbor selection scales poorly as the size of the training set
• An instrument timbre explorer, using the audio-based latent       grows.
  space of instrument samples generated by NSynth [8].              The Hyperviolin [12] is a physical, violin-like instrument that
The former is implemented using deeplearn.js [21] for GPU-          makes use of cross-synthesis to produce timbres mixing be-
accelerated inference in the browser, enabling dynamic explo-       tween various instruments such as a flute or choral vocals. The
ration with no additional installation or setup required. It can    developers demonstrated multiple methods of controlling the
output both audio and MIDI. The latter is implemented as a          relative weights of the high-level timbral features via “sen-
Max For Live device with pre-synthesized samples, providing         sor shoes”. The performer could control some constrained
access to a massive number of timbres from within Ableton           aspects of the timbre within a preset space in real-time with
Live, a widely-used, professional-grade production tool.            movements of her foot and could move between different pre-
                                                                    sets throughout a piece by making larger physical movements
RELATED WORK                                                        along a path.
There is a fair amount of prior work on the concept of “musical     The Morph Table [6] uses a 3-D interface with physical cubes
morphing” in the space of both audio and compositions. In the       allowing multiple users to control the musical output of the
audio realm, techniques such as cross-synthesis use heuristics      system by morphing in both audio and compositional spaces.
or machine learning to mix hand-designed features of instru-        Movements of each cube in one dimension are associated
ment timbres [12]. For compositions, heuristics and models          with compositional morphing using techniques such as key-
based on music theory are used to morph between existing            modulation, linear interpolation of note features (pitch, onset,
pieces [5, 9].                                                      duration, etc.) or “cross-fading” between subsections. In a
MMorph [19] is described as a real-time tool that allows users      second dimension, standard audio effects are are adjusted.
to morph between up to four MIDI compositions by mixing             Movements in a third dimension–accessed by rotating the
different elements of each piece (such as rhythm, pitch, dy-        cubes to expose different faces–changes the morphing end-
namics, timbre, harmony). Via a SmallTalk GUI, a user could         points between six different presets.
specify which elements of each MIDI file to morph with check-       Existing examples of AIA interfaces using latent spaces in-
boxes and drag a control around a 2-D space to determine the        clude examples for generating faces [14], fonts [7], and generic
relative amounts contributed by each of the four pieces in          images such as shoes and landscapes [23].
real-time. The user could also choose from several different
morphing methods such as “interpolation” and “weighting”,
                                                                    METHODS
although no description of these methods is provided.
                                                                    Both of the interfaces introduced in this paper employ autoen-
Hamanaka et al. [9] introduce a simple copy-and-paste inter-        coders, unsupervised machine learning models each composed
face for both interpolating between and extrapolating beyond        of three parts: an encoder, a latent vector, and a decoder (Fig-
pairs of melodies by reducing the melodies to a low-dimension       ure 1). The latent vector, z, is an intermediate representation
representations based on music theory.                              of the data examples, x, but is either of lower dimension or is
regularized to produce an information bottleneck. In the case
of variational autoencoders (VAE), z is regularized to encour-
age the model to learn low-entropy latent representations of
x close to a standard multivariate Gaussian distribution. The
encoder is a neural network that produces z from the input x
and the decoder is a separate neural network that attempts to
reconstruct x from z. Gradient-based optimization is then used
to reduce the reconstruction loss (the difference between the
encoded input and decoded output) and KL divergence from
the Gaussian prior (if applicable) [13].
The purpose of the bottleneck is that it forces the model to
distill the information content of the data to lower-entropy rep-
resentations. In the process, it discovers and encodes the most     Figure 2. The MusicVAE Sequencer. On the left is an editable sequencer
                                                                    for modifying the corner melodies or rhythms, along with several con-
important features that differentiate data examples. The model      trols including a toggle between melody and drum modes and a drop-
becomes specialized to produce examples like those from the         down selector for MIDI output. On the right is the 2-D interpolation
training distribution, making it difficult for the decoder to       palette where each interior square contains a sequence generated by de-
produce “unrealistic” outputs that are improbable under the         coding an interpolated point from the latent space between the corner
                                                                    sequences. The interface is shown in “Draw” mode, where the the white
distribution p(x). However, when trained properly it is general     puck will follow along the user-drawn 1-D curve (in this case shaped like
enough to be able to reconstruct and generate examples that         an “M”), morphing the sequence in real-time as it moves through the
are from the same distribution as p(x) but do not appear in the     latent space.
train set. The resulting latent space is therefore optimized for
various forms of exploration including interpolation within the     We train the NSynth model on the NSynth dataset, a collection
latent space, which can “morph” between values in the output        of ~300,000 individual 4-second notes recorded from ~1,000
space, producing realistic intermediate outputs that combine        separate instruments and synthesizers [8]. By choosing to train
features of the endpoints to varying degrees.                       on individual pitched notes, the model can then be used as a
                                                                    “neural synthesizer” to playback notes when receiving an input
MusicVAE                                                            MIDI signal.
MusicVAE is a VAE for musical sequences, such as drum beats         Since the model learns an expressive code for raw audio, it
and melodies. It uses LSTM [10] recurrent neural networks           can be used to interpolate in this space and discover new
as its encoder and decoder. For our purposes, we focus on           instrument timbres that exist between pre-existing instruments.
models learned from either 2-bar drum beats or 2-bar melody         However, while the latent space is of much lower dimension
loops. Training sets were obtained by scraping the web for          than the original audio, there is no prior to sample from as was
MIDI files and extracting all unique 2-bar sequences of the         the case for MusicVAE. Therefore, there is no trivial way to
two types, resulting in 28 million melodies and 3.8 million         randomly sample novel musical notes from the distribution,
drum beats.                                                         and we are limited to exploring subspaces anchored on known
To produce good reconstructions (and interpolations), we train      notes.
a VAE with a trade-off parameter that assigns a lower weight
to the KL divergence, thus allowing the system to pass enough       INTERFACES
information through the bottleneck to be able to reproduce
nearly any realistic input sequence in our dataset. However,        MusicVAE Sequencer
models trained in this way on noisy datasets such as ours often     The MusicVAE Sequencer (Figure 2) is an interface to the la-
do not produce realistic examples when sampling random              tent spaces of 2-bar drum loops and monophonic melodies, as
points from the Gaussian prior. To produce better random            described above. Users can toggle between these two spaces
samples, we train VAEs with the trade-off parameter set to a        and define the four corner sequences of a 2-D, 11x11 grid by
value that encourages a better KL divergence at the expense         randomly sampling from the latent space (using the low-KL
of reconstruction quality.                                          model), selecting from a predefined list, or inputting them
                                                                    manually into the sequencer at the resolution of 16th notes.
                                                                    Once the corner sequences are set, they are passed through
NSynth
                                                                    the encoder of the VAE to determine their latent vectors. The
With NSynth, we are able to learn latent spaces not just of
                                                                    latent vectors are then mixed using bi-linear interpolation in
musical sequences, but of audio waveforms themselves. Like
                                                                    the 11 × 11 space, and the resulting values are decoded into se-
the MusicVAE, NSynth is an autoencoder architecture (though
                                                                    quences (from the high-KL model). With our implementation
not variational) that can learn to encode and decode sequences
                                                                    of the model architecture in deeplearn.js using weights learned
from a compressed latent space. Since the model must learn          via the original TensorFlow [2] model, the encoding, inter-
to most efficiently use the lower-entropy representation to         polation, and decoding can all be executed with sub-second
represent waveforms, it learns salient acoustic features among      latency on a typical consumer laptop.
the training data. However, the model architecture differs
significantly from MusicVAE as it must capture correlations         Now that the palette is filled, the user can drag the white “puck”
over thousands of quantized waveform timesteps [8].                 around to hear the drum beat or melody loop in each square
                                                                         Incoming MIDI notes are played back in the appropriate tim-
                                                                         bre based on the position on the palette. Additional controls
                                                                         include the ability to adjust where each sample the playback
                                                                         should start, the playback speed, and settings for an Attack
Figure 3. The NSynth Instrument alongside other Ableton devices that     Decay Sustain Release (ADSR) envelope. Furthermore, the
it can be used in conjunction with to synthesize incoming MIDI events.
                                                                         NSynth Instrument can be combined with other Live devices
                                                                         to add additional effects such as an arpeggiator.
of the grid. When the puck moves to a new square, the se-
quencer immediately updates but the play-head does not reset             Finally, users are able to map all of these settings to an external
to the beginning, allowing the music to morph immediately                MIDI controller for ease of use in a studio or live setting.
but smoothly. The sound is played via the browser using preset
audio samples with Tone.js [15] and can optionally be output             EXPLORING THE INTERFACE
as MIDI events via Web MIDI [1] to be further manipulated                Interfaces provide mappings from the user’s actions to the
and synthesized externally.                                              output space. In the case of the MusicVAE sequencer, the input
The corner sequences can be changed at any time by the same              space consists of four user-defined basis points (the corners)
methods mentioned above or by dragging one of the interior               combined with a 2-D control surface (the palette), and the
squares to a corner of the grid, which will set it as the new            output space consists of note or drum sequences. Generally,
corner and cause the interior points to be updated.                      choosing the mapping is a crucial element of designing a
                                                                         user interface, with numerous trade-offs at play, and where
If using the sequencer as a composition tool, the user can               the goals often include carefully and deliberately shaping the
record the MIDI or audio externally, or she can right click on           user’s cognitive model of the system [17]. In our case, the
any square to download a MIDI file containing that particular            mapping is learned with a VAE, and therefore this raises the
sequence to use as a part of an arrangement.                             question of how the user will understand and interact with
                                                                         it: from the user’s perspective, how does this mapping work?
In the case of improvisation or live performance, the user
                                                                         [18] describes some of the steps that a user may take when
may also use the “draw” tool to define a 1-D curve within the
                                                                         learning continuous mappings for complex controls over high-
palette, which the puck will then move through at a rate the
                                                                         dimensional output spaces; we follow some of those basic
user controls. As the puck moves into each new square, the
                                                                         approaches to explore the MusicVAE Sequencer interface. In
playback sequence is updated, just as if the user had moved
                                                                         particular, we show how to start isolating and exaggerating
the puck there by hand. In this manner, a performer can
                                                                         aspects of the mapping.
set up paths in both the drum and melody palettes that will
continuously evolve at potentially different rates, introducing          One example of exaggeration is that we set three of the basis
musically interesting phasing effects.                                   points (bottom-left, top-left, and top-right) to all be empty, and
                                                                         set the bottom-right to be entirely full (i.e. every percussion
                                                                         instrument playing at every 16th note). In that case, our inter-
NSynth Instrument
                                                                         polation lets us check what the system does as it goes from
The NSynth model is extremely computationally expensive                  completely sparse (silence) to completely dense, as shown in
(~30 minutes of GPU synthesis to generate four seconds of                Figure 4.
audio) and therefore presented a distinct challenge in creating
an interactive experience for interpolating within the latent            As we interpolate from empty to full, we notice that the system
space in real-time on a laptop. Rather than generating sounds            tends to start by populating the sequencer with percussion
on demand, we curated a set of original sounds ahead of time             hits on the actual beats, then on the 8th notes, and then on
and synthesized all of their interpolated latent representations.        the 16th notes, i.e. it progresses from more likely to less
To produce even finer-grained resolution during playback, tran-          likely beats. It also progresses from sparser beat sequences
sitions are smoothed out by additionally mixing the audio in             to denser ones. This satisfies the requirements outlined in the
real-time from the nearest neighbor sounds on the grid. This is          introduction as follows: (1) smoothness is preserved, in that
a straightforward case of trading off computation for memory.            each sample along the interpolated latent path corresponds to
                                                                         beat sequences that are similar to (i.e. a little more or less
We created a playable instrument integrated into Ableton Live            dense than) those corresponding to its adjacent samples, and
as a Max For Live device (Figure 3). We positioned real                  (2) by first adding beats on quarter notes, then 8th notes, etc,
instrument timbres at the corners of a square grid, allowing             this suggests that the system is maintaining feasible (i.e. more
the user to mix between all four using the latent space as a             likely) outputs1 .
palette. We created further “multigrid” modes, tiling many
four-instrument grids side by side. This results in an 7x7 grid          Figure 5 shows another simple experiment in which we try
of 4x4 grids, enabling a user to explore up to 64 different              isolating the location (phase) of the beat in the bar. We see that
instruments by dragging across a single x-y pad. Given an x-y
                                                                         1 For example, one could imagine a sequence that goes from sparse
point, the device finds the nearest four interpolated samples
and plays them back with volume proportional to the their                to dense, but in which the new percussion hits are added by simply
                                                                         choosing random cells in the grid from a uniform distribution over
x-y distance from the point. Users can choose from 5 pre-                the grid. This could still be a smooth mapping, but it would have the
generated grids and 3 multigrids or produce their own using              problem that the intermediate beat sequences, during interpolation,
the TensorFlow code released with the NSynth model [16].                 would generally be very unlikely.
Figure 4. Interpolating between empty and full. The dark grid on the left represents the piano roll of the percussion, and the multi-coloured square grid
on the right (the palette) represents the (2-dimensional) latent space. The white puck represents a point in the latent space, and the piano roll shows that
point’s corresponding percussion sequence. For all six sub-figures here, the 4 corners of the latent space grid are fixed: the bottom-right corner has all
drum hits enabled, and the other three corners are empty. For example, in sub-figure at (Row 2, Col 1), the puck is just next to the bottom-right corner,
and indeed, the corresponding piano roll shows a very dense sequence of percussion hits. Conversely, in (Row 1, Col 2), the puck is approximately near
the centre, and we see a much sparser pattern. As the puck is moved from an empty corner toward the bottom-right, drum hits are added first on
quarter notes, then eighth notes, and finally 16th notes, until the sequencer eventually fills as it nears the corner. The final two images on the second
row (Row 2, Col 2) and (Row 2, Col 3) illustrate the expected symmetry with respect to the diagonal for this configuration (i.e. the percussion rolls are
essentially identical for these two points on either side of the (top-left, bottom-right) diagonal.

rather than interpolating the phase (which would gradually                       with high accuracy. While it is likely that such systems could
shift the offset), there is a tendency to superpose the two rhyth-               be used to produce satisfying music for listeners, our work
mic figures. This, too, is a reasonable rhythmic approach. We                    indicates directions for how we might use machine learning—
note that doing so in melody space would not necessarily be as                   specifically latent-space representations— to create UI map-
natural of a choice (and in fact the system would not have that                  pings that we hope will eventually provide music creators with
option since the melodies are constrained to be monophonic).                     new and interesting creativity-supporting tools.
While there is usually a lot more sophistication occurring in
the interpolations aside from superposition, these results for                   ACKNOWLEDGMENTS
simple cases have a natural interpretation for the user, provid-                 We thank Torin Blankensmith and Kyle Phillips from Google
ing landmarks on which to ground their understanding and                         Creative Lab for implementing the MusicVAE Sequencer UI.
cognitive models of the system.                                                  We thank Nikhil Thorat for assistance with the deeplearn.js
                                                                                 implementation. We thank the members of the Magenta team
In Figure 6, we explore interpolation in melody space by
                                                                                 for helpful discussions and observations.
providing several different scales as the four basis points. As
we move from top-left to top-right, we notice that at each step
                                                                                 REFERENCES
the output sequence is similar to the previous step, but that                     1. 2015. Web MIDI API.
the combination of these smooth steps ultimately results in a                        http://webaudio.github.io/web-midi-api/. (2015). Accessed:
very large change. Interestingly, at around a third of the way                       2017-12-21.
across, the output sequence includes some descending notes,                       2. Martín Abadi and others. 2015. TensorFlow: Large-Scale Machine
even though both scales ascend. This shows that the system                           Learning on Heterogeneous Distributed Systems. White Paper. Google.
will sometimes move away from the basis points during the                            http:
                                                                                     //download.tensorflow.org/paper/whitepaper2015.pdf
interpolation.
                                                                                  3. Mason Bretan, Sageev Oore, Jesse Engel, Douglas Eck, and Larry Heck.
Finally, we note that some parameters, such as the number of                         2017a. Deep Music: Towards Musical Dialogue. In Proceedings of the
grid points, were chosen in order to make the system func-                           Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17).
tional, with the focus on learning an effective latent space                      4. Mason Bretan, Gil Weinberg, and Larry Heck. 2017b. A Unit Selection
mapping, but in future, these are design elements that would                         Methodology for Music Generation Using Deep Neural Networks. In
                                                                                     International Conference on Computational Creativity (ICCC 2017).
be worthwhile exploring more systematically and with user
testing.                                                                          5. Andrew Brown and Rene Wooller. 2005. Investigating Morphing
                                                                                     Algorithms for Generative Music. In Proceedings of Third Iteration,
                                                                                     T Innocent (Ed.). Centre for Electronic Media Art, Australia, Victoria,
AVAILABILITY                                                                         Melbourne, 189–198. https://eprints.qut.edu.au/24696/
Supplementary resources including opensource code are avail-
                                                                                  6. Andrew Brown, René Wooller, and Kate Thomas. 2007. The Morph
able for both NSynth and MusicVAE interfaces in correspond-                          Table: A collaborative interface for musical interaction. In Australasian
ing sub-directories of Magenta’s demo github repository2 .                           Computer Music Conference (ACMC). 34–39.
                                                                                  7. Shan Carter and Michael Nielsen. 2017. Using Artificial Intelligence to
CONCLUSION                                                                           Augment Human Intelligence. Distill (2017).
This work demonstrates the use of machine learning as the                            http://distill.pub/2017/aia
basis for creative musical tools. Machine learning is often seen                  8. Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas
as a method for outsourcing mundane discriminative tasks, re-                        Eck, Karen Simonyan, and Mohammad Norouzi. 2017. Neural Audio
sulting in the desire for rigid systems that perform their duties                    Synthesis of Musical Notes with WaveNet Autoencoders. In
                                                                                     International Conference on Machine Learning (ICML).
2 https://github.com/tensorflow/magenta-demos/                                       https://arxiv.org/abs/1704.01279
Figure 5. Interpolation between playing “on” beats and playing “off” beats shows a moment of superposition, which we have observed with some other
basic rhythmic patterns as well, and which has some natural interpretations.


                                   Figure 6. Here we interpolate in melody space between four different scale patterns.


 9. Masatoshi Hamanaka, Keiji Hirata, and Satoshi Tojo. 2009. Melody            17. Donald A. Norman. 2002. The Design of Everyday Things. Basic Books,
    morphing method based on GTTM. In International Computer Music                  Inc., New York, NY, USA.
    Conference (ICMC). 89–92.                                                   18. Sageev Oore. 2005. Learning Advanced Skills on New Instruments. In
    http://hdl.handle.net/2027/spo.bbp2372.2009.020
                                                                                    International Conference on New Interfaces for Musical Expression
10. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term                   (NIME). Vancouver, Canada.
    Memory. Neural Computation 9, 8 (1997), 1735–1780. DOI:
                                                                                19. Daniel V. Oppenheim. 1995. Demonstrating MMorph: A System for
    http://dx.doi.org/10.1162/neco.1997.9.8.1735
                                                                                    Morphing Music in Real-time. In International Computer Music
11. IEC 60908 1987. Audio recording - Compact disc digital audio system .           Conference (ICMC). 479–480.
    Standard. International Electrotechnical Commission (IEC).                  20. Adam Roberts, Jesse Engel, and Douglas Eck. 2017. Hierarchical
12. Tristan Jehan. 2001. Perceptual synthesis engine: an audio-driven timbre        Variational Autoencoders for Music. In NIPS Workshop on Machine
    generator. Master’s thesis. Massachusetts Institute of Technology (MIT).        Learning for Creativity and Design.
    http://hdl.handle.net/1721.1/61543                                              https://nips2017creativity.github.io/doc/Hierarchical_
13. Diederik P. Kingma and Max Welling. 2013. Auto-Encoding Variational             Variational_Autoencoders_for_Music.pdf
    Bayes. In International Conference on Learning Representations (ICLR).      21. Daniel Smilkov, Nikhil Thorat, and Charles Nicholson. 2017.
    http://arxiv.org/abs/1312.6114                                                  deeplearn.js: a hardware-accelerated machine intelligence library for the
14. Ian Loh and Tom White. 2017. TopoSketch: Drawing in Latent Space. In            web. https://deeplearnjs.org/. (2017). Accessed: 2017-12-21.
    NIPS Workshop on Machine Learning for Creativity and Design. https:         22. Laurens van der Maaten, Eric Postma, and Jaap van den Herik. 2009.
    //nips2017creativity.github.io/doc/TopoSketch.pdf                               Dimensionality reduction: A comparative review. Technical Report.
15. Yotam Mann. 2015. Interactive Music with Tone.js. In Web Audio                  TiCC, Tilburg University.
    Conference (WAC).                                                           23. Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and Alexei A. Efros.
    http://wac.ircam.fr/pdf/wac15_submission_40.pdf                                 2016. Generative visual manipulation on the natural image manifold. In
16. Parag Mital. 2017. Generate your own sounds with NSynth.                        European Conference on Computer Vision (ECCV).
    https://magenta.tensorflow.org/nsynth-fastgen. (2017).                          https://arxiv.org/abs/1609.03552
    Accessed: 2017-12-21.

</pre>