<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Learning Latent Representations of Music to Generate Interactive Musical Palettes</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jesse Engel Google Brain Mountain View</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>USA jesseengel@google.com</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Douglas Eck Google Brain Mountain View</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>USA deck@google.com</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Adam Roberts Google Brain Mountain View</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sageev Oore Dalhousie University &amp; Vector Institute</institution>
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Advances in machine learning have the potential to radically reshape interactions between humans and computers. Deep learning makes it possible to discover powerful representations that are capable of capturing the latent structure of highdimensional data such as music. By creating interactive latent space “palettes” of musical sequences and timbres, we demonstrate interfaces for musical creation made possible by machine learning. We introduce an interface to the intuitive, low-dimensional control spaces for high-dimensional note sequences, allowing users to explore a compositional space of melodies or drum beats in a simple 2-D grid. Furthermore, users can define 1-D trajectories in the 2-D space for autonomous, continuous morphing during improvisation. Similarly for timbre, our interface to a learned latent space of audio provides an intuitive and smooth search space for morphing between the timbres of different instruments. We remove technical and computational barriers by embedding pre-trained networks into a browser-based GPU-accelerated framework, making the systems accessible to a wide range of users while maintaining potential for creative flexibility and personalization.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Author Keywords
musical interface; latent space; variational autoencoder; deep
learning
INTRODUCTION
Music, when treated as data, is often represented in
highdimensional spaces. Digital music formats such as CD-quality
Pulse-code modulation (PCM), for example, records audible
vibrations as a discrete sequence of 44.1 thousand 16-bit
integers per second [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]; audio can then be modelled by treating
each sample as a unique dimension and capturing
correlations between them. Alternately, musical compositions can be
communicated as a score; in one heavily constrained version
©2018. Copyright for the individual papers remains with the authors.
Copying permitted for private and academic purposes.
      </p>
      <p>MILC ’18, March 11, 2018, Tokyo, Japan
* Research conducted while this author was at Google Brain.
of this, for example, we could represent sequences of
monophonic 16th notes of equal intensity with approximately 7 bits
for each note, or 112 per bar. That is far fewer dimensions than
audio, but even there, exploring all possible variations of a
score by flipping one bit at a time quickly becomes intractable,
and further, would result in a large proportion of melodies
being so musically unconventional that they would easily be
perceived as being incoherent.</p>
      <p>
        While the high dimensionality affords an exponential
number of possibilities, only some of these possibilities are likely
for real music, which could be seen as residing on a
lowerdimensional manifold within the space. Machine learning
techniques can learn the shape of such low-dimensional
manifolds from data, and be used to gain better understanding of
and to explore large datasets [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. They can also be used to
build creative tools within the realm of “Artificial Intelligence
Augmentation” (AIA) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Learning the reduction directly
from the data allows us to avoid heuristics and hand-tuned
features, along with the biases and preconceptions about the
data that would normally accompany those.
      </p>
      <p>
        Autoencoders and variational autoencoders [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] are models
designed to learn efficient, low-dimensional representations
capable of reproducing their high-dimensional inputs. The
hope is that in order to effectively utilize the “latent space”,
they will learn a mapping that is “effective”. What do we mean
by this, or rather, what might be some desirable characteristics
for such a mapping? First we might wish for smoothness: for
example, if two points are near each other in latent space, then
we would like for their corresponding points in the output
space to also be near one another. In the constrained case of
monophonic melodies mentioned above, this would mean that
we would like for the two monophonic sequences to be
perceptually similar. Second, while the original high-dimensional
space allows for very unlikely points, we would like the latent
space to correspond primarily to the likely ones: that is, if we
map from a sampled point in the latent space to the original
space, we would like the resulting point to be “feasible”, i.e.
not one of those unconventionally incoherent sequences that
we described earlier. If we can satisfy these two requirements,
then that would allow interpolation in the latent space to
correspond to a meaningful interpolation in the original space. For
example, if A and B are score representations of two
monophonic melodies, and f (A) and f (B) are their corresponding
latent representations, then as we sample N + 1 points along
the line between the latent points f (A) and f (B):
ci = ai f (A) + (1
ai) f (B)
where ai = i=N and i runs from 0 : : : N, then Ci = f 1(ci)
should always be a feasible (i.e. statistically “likely”) melody,
and also Ci should be perceptually fairly similar to Ci+1. That
is, the smoothness of the latent space with respect to the
outputs makes it possible to define projections of the
output space onto 1-D line segments and 2-D rectangles [
        <xref ref-type="bibr" rid="ref14 ref7">14, 7</xref>
        ].
Tracing along such low-dimensional manifolds, we can thus
morph from one melody to another in an interesting way.
Lowdimensional representations thus potentially afford interesting
visualizations and natural interactions.
      </p>
      <p>In this paper, we present two interfaces based on the principle
of musical latent spaces:</p>
      <p>
        A musical sequence explorer for 2-bar melody loops and
drum beats, using the latent space of MusicVAE [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
An instrument timbre explorer, using the audio-based latent
space of instrument samples generated by NSynth [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
The former is implemented using deeplearn.js [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] for
GPUaccelerated inference in the browser, enabling dynamic
exploration with no additional installation or setup required. It can
output both audio and MIDI. The latter is implemented as a
Max For Live device with pre-synthesized samples, providing
access to a massive number of timbres from within Ableton
Live, a widely-used, professional-grade production tool.
RELATED WORK
There is a fair amount of prior work on the concept of “musical
morphing” in the space of both audio and compositions. In the
audio realm, techniques such as cross-synthesis use heuristics
or machine learning to mix hand-designed features of
instrument timbres [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. For compositions, heuristics and models
based on music theory are used to morph between existing
pieces [
        <xref ref-type="bibr" rid="ref5 ref9">5, 9</xref>
        ].
      </p>
      <p>
        MMorph [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] is described as a real-time tool that allows users
to morph between up to four MIDI compositions by mixing
different elements of each piece (such as rhythm, pitch,
dynamics, timbre, harmony). Via a SmallTalk GUI, a user could
specify which elements of each MIDI file to morph with
checkboxes and drag a control around a 2-D space to determine the
relative amounts contributed by each of the four pieces in
real-time. The user could also choose from several different
morphing methods such as “interpolation” and “weighting”,
although no description of these methods is provided.
Hamanaka et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] introduce a simple copy-and-paste
interface for both interpolating between and extrapolating beyond
pairs of melodies by reducing the melodies to a low-dimension
representations based on music theory.
      </p>
      <p>
        The Hyperviolin [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is a physical, violin-like instrument that
makes use of cross-synthesis to produce timbres mixing
between various instruments such as a flute or choral vocals. The
developers demonstrated multiple methods of controlling the
relative weights of the high-level timbral features via
“sensor shoes”. The performer could control some constrained
aspects of the timbre within a preset space in real-time with
movements of her foot and could move between different
presets throughout a piece by making larger physical movements
along a path.
      </p>
      <p>
        The Morph Table [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] uses a 3-D interface with physical cubes
allowing multiple users to control the musical output of the
system by morphing in both audio and compositional spaces.
Movements of each cube in one dimension are associated
with compositional morphing using techniques such as
keymodulation, linear interpolation of note features (pitch, onset,
duration, etc.) or “cross-fading” between subsections. In a
second dimension, standard audio effects are are adjusted.
Movements in a third dimension–accessed by rotating the
cubes to expose different faces–changes the morphing
endpoints between six different presets.
      </p>
      <p>
        Existing examples of AIA interfaces using latent spaces
include examples for generating faces [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], fonts [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and generic
images such as shoes and landscapes [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
      </p>
      <p>
        METHODS
Both of the interfaces introduced in this paper employ
autoencoders, unsupervised machine learning models each composed
of three parts: an encoder, a latent vector, and a decoder
(Figure 1). The latent vector, z, is an intermediate representation
of the data examples, x, but is either of lower dimension or is
regularized to produce an information bottleneck. In the case
of variational autoencoders (VAE), z is regularized to
encourage the model to learn low-entropy latent representations of
x close to a standard multivariate Gaussian distribution. The
encoder is a neural network that produces z from the input x
and the decoder is a separate neural network that attempts to
reconstruct x from z. Gradient-based optimization is then used
to reduce the reconstruction loss (the difference between the
encoded input and decoded output) and KL divergence from
the Gaussian prior (if applicable) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>The purpose of the bottleneck is that it forces the model to
distill the information content of the data to lower-entropy
representations. In the process, it discovers and encodes the most
important features that differentiate data examples. The model
becomes specialized to produce examples like those from the
training distribution, making it difficult for the decoder to
produce “unrealistic” outputs that are improbable under the
distribution p(x). However, when trained properly it is general
enough to be able to reconstruct and generate examples that
are from the same distribution as p(x) but do not appear in the
train set. The resulting latent space is therefore optimized for
various forms of exploration including interpolation within the
latent space, which can “morph” between values in the output
space, producing realistic intermediate outputs that combine
features of the endpoints to varying degrees.</p>
      <p>
        MusicVAE
MusicVAE is a VAE for musical sequences, such as drum beats
and melodies. It uses LSTM [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] recurrent neural networks
as its encoder and decoder. For our purposes, we focus on
models learned from either 2-bar drum beats or 2-bar melody
loops. Training sets were obtained by scraping the web for
MIDI files and extracting all unique 2-bar sequences of the
two types, resulting in 28 million melodies and 3.8 million
drum beats.
      </p>
      <p>To produce good reconstructions (and interpolations), we train
a VAE with a trade-off parameter that assigns a lower weight
to the KL divergence, thus allowing the system to pass enough
information through the bottleneck to be able to reproduce
nearly any realistic input sequence in our dataset. However,
models trained in this way on noisy datasets such as ours often
do not produce realistic examples when sampling random
points from the Gaussian prior. To produce better random
samples, we train VAEs with the trade-off parameter set to a
value that encourages a better KL divergence at the expense
of reconstruction quality.</p>
      <p>
        NSynth
With NSynth, we are able to learn latent spaces not just of
musical sequences, but of audio waveforms themselves. Like
the MusicVAE, NSynth is an autoencoder architecture (though
not variational) that can learn to encode and decode sequences
from a compressed latent space. Since the model must learn
to most efficiently use the lower-entropy representation to
represent waveforms, it learns salient acoustic features among
the training data. However, the model architecture differs
significantly from MusicVAE as it must capture correlations
over thousands of quantized waveform timesteps [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        We train the NSynth model on the NSynth dataset, a collection
of ~300,000 individual 4-second notes recorded from ~1,000
separate instruments and synthesizers [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. By choosing to train
on individual pitched notes, the model can then be used as a
“neural synthesizer” to playback notes when receiving an input
MIDI signal.
      </p>
      <p>Since the model learns an expressive code for raw audio, it
can be used to interpolate in this space and discover new
instrument timbres that exist between pre-existing instruments.
However, while the latent space is of much lower dimension
than the original audio, there is no prior to sample from as was
the case for MusicVAE. Therefore, there is no trivial way to
randomly sample novel musical notes from the distribution,
and we are limited to exploring subspaces anchored on known
notes.</p>
      <p>
        INTERFACES
MusicVAE Sequencer
The MusicVAE Sequencer (Figure 2) is an interface to the
latent spaces of 2-bar drum loops and monophonic melodies, as
described above. Users can toggle between these two spaces
and define the four corner sequences of a 2-D, 11x11 grid by
randomly sampling from the latent space (using the low-KL
model), selecting from a predefined list, or inputting them
manually into the sequencer at the resolution of 16th notes.
Once the corner sequences are set, they are passed through
the encoder of the VAE to determine their latent vectors. The
latent vectors are then mixed using bi-linear interpolation in
the 11 11 space, and the resulting values are decoded into
sequences (from the high-KL model). With our implementation
of the model architecture in deeplearn.js using weights learned
via the original TensorFlow [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] model, the encoding,
interpolation, and decoding can all be executed with sub-second
latency on a typical consumer laptop.
      </p>
      <p>
        Now that the palette is filled, the user can drag the white “puck”
around to hear the drum beat or melody loop in each square
of the grid. When the puck moves to a new square, the
sequencer immediately updates but the play-head does not reset
to the beginning, allowing the music to morph immediately
but smoothly. The sound is played via the browser using preset
audio samples with Tone.js [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and can optionally be output
as MIDI events via Web MIDI [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] to be further manipulated
and synthesized externally.
      </p>
      <p>The corner sequences can be changed at any time by the same
methods mentioned above or by dragging one of the interior
squares to a corner of the grid, which will set it as the new
corner and cause the interior points to be updated.
If using the sequencer as a composition tool, the user can
record the MIDI or audio externally, or she can right click on
any square to download a MIDI file containing that particular
sequence to use as a part of an arrangement.</p>
      <p>In the case of improvisation or live performance, the user
may also use the “draw” tool to define a 1-D curve within the
palette, which the puck will then move through at a rate the
user controls. As the puck moves into each new square, the
playback sequence is updated, just as if the user had moved
the puck there by hand. In this manner, a performer can
set up paths in both the drum and melody palettes that will
continuously evolve at potentially different rates, introducing
musically interesting phasing effects.</p>
      <p>
        NSynth Instrument
The NSynth model is extremely computationally expensive
(~30 minutes of GPU synthesis to generate four seconds of
audio) and therefore presented a distinct challenge in creating
an interactive experience for interpolating within the latent
space in real-time on a laptop. Rather than generating sounds
on demand, we curated a set of original sounds ahead of time
and synthesized all of their interpolated latent representations.
To produce even finer-grained resolution during playback,
transitions are smoothed out by additionally mixing the audio in
real-time from the nearest neighbor sounds on the grid. This is
a straightforward case of trading off computation for memory.
We created a playable instrument integrated into Ableton Live
as a Max For Live device (Figure 3). We positioned real
instrument timbres at the corners of a square grid, allowing
the user to mix between all four using the latent space as a
palette. We created further “multigrid” modes, tiling many
four-instrument grids side by side. This results in an 7x7 grid
of 4x4 grids, enabling a user to explore up to 64 different
instruments by dragging across a single x-y pad. Given an x-y
point, the device finds the nearest four interpolated samples
and plays them back with volume proportional to the their
x-y distance from the point. Users can choose from 5
pregenerated grids and 3 multigrids or produce their own using
the TensorFlow code released with the NSynth model [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
Incoming MIDI notes are played back in the appropriate
timbre based on the position on the palette. Additional controls
include the ability to adjust where each sample the playback
should start, the playback speed, and settings for an Attack
Decay Sustain Release (ADSR) envelope. Furthermore, the
NSynth Instrument can be combined with other Live devices
to add additional effects such as an arpeggiator.
      </p>
      <p>
        Finally, users are able to map all of these settings to an external
MIDI controller for ease of use in a studio or live setting.
EXPLORING THE INTERFACE
Interfaces provide mappings from the user’s actions to the
output space. In the case of the MusicVAE sequencer, the input
space consists of four user-defined basis points (the corners)
combined with a 2-D control surface (the palette), and the
output space consists of note or drum sequences. Generally,
choosing the mapping is a crucial element of designing a
user interface, with numerous trade-offs at play, and where
the goals often include carefully and deliberately shaping the
user’s cognitive model of the system [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. In our case, the
mapping is learned with a VAE, and therefore this raises the
question of how the user will understand and interact with
it: from the user’s perspective, how does this mapping work?
[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] describes some of the steps that a user may take when
learning continuous mappings for complex controls over
highdimensional output spaces; we follow some of those basic
approaches to explore the MusicVAE Sequencer interface. In
particular, we show how to start isolating and exaggerating
aspects of the mapping.
      </p>
      <p>One example of exaggeration is that we set three of the basis
points (bottom-left, top-left, and top-right) to all be empty, and
set the bottom-right to be entirely full (i.e. every percussion
instrument playing at every 16th note). In that case, our
interpolation lets us check what the system does as it goes from
completely sparse (silence) to completely dense, as shown in
Figure 4.</p>
      <p>As we interpolate from empty to full, we notice that the system
tends to start by populating the sequencer with percussion
hits on the actual beats, then on the 8th notes, and then on
the 16th notes, i.e. it progresses from more likely to less
likely beats. It also progresses from sparser beat sequences
to denser ones. This satisfies the requirements outlined in the
introduction as follows: (1) smoothness is preserved, in that
each sample along the interpolated latent path corresponds to
beat sequences that are similar to (i.e. a little more or less
dense than) those corresponding to its adjacent samples, and
(2) by first adding beats on quarter notes, then 8th notes, etc,
this suggests that the system is maintaining feasible (i.e. more
likely) outputs1.</p>
      <p>Figure 5 shows another simple experiment in which we try
isolating the location (phase) of the beat in the bar. We see that
1For example, one could imagine a sequence that goes from sparse
to dense, but in which the new percussion hits are added by simply
choosing random cells in the grid from a uniform distribution over
the grid. This could still be a smooth mapping, but it would have the
problem that the intermediate beat sequences, during interpolation,
would generally be very unlikely.
rather than interpolating the phase (which would gradually
shift the offset), there is a tendency to superpose the two
rhythmic figures. This, too, is a reasonable rhythmic approach. We
note that doing so in melody space would not necessarily be as
natural of a choice (and in fact the system would not have that
option since the melodies are constrained to be monophonic).
While there is usually a lot more sophistication occurring in
the interpolations aside from superposition, these results for
simple cases have a natural interpretation for the user,
providing landmarks on which to ground their understanding and
cognitive models of the system.</p>
      <p>In Figure 6, we explore interpolation in melody space by
providing several different scales as the four basis points. As
we move from top-left to top-right, we notice that at each step
the output sequence is similar to the previous step, but that
the combination of these smooth steps ultimately results in a
very large change. Interestingly, at around a third of the way
across, the output sequence includes some descending notes,
even though both scales ascend. This shows that the system
will sometimes move away from the basis points during the
interpolation.</p>
      <p>Finally, we note that some parameters, such as the number of
grid points, were chosen in order to make the system
functional, with the focus on learning an effective latent space
mapping, but in future, these are design elements that would
be worthwhile exploring more systematically and with user
testing.</p>
      <p>AVAILABILITY
Supplementary resources including opensource code are
available for both NSynth and MusicVAE interfaces in
corresponding sub-directories of Magenta’s demo github repository2.
CONCLUSION
This work demonstrates the use of machine learning as the
basis for creative musical tools. Machine learning is often seen
as a method for outsourcing mundane discriminative tasks,
resulting in the desire for rigid systems that perform their duties
2https://github.com/tensorflow/magenta-demos/
with high accuracy. While it is likely that such systems could
be used to produce satisfying music for listeners, our work
indicates directions for how we might use machine learning—
specifically latent-space representations— to create UI
mappings that we hope will eventually provide music creators with
new and interesting creativity-supporting tools.</p>
      <p>ACKNOWLEDGMENTS
We thank Torin Blankensmith and Kyle Phillips from Google
Creative Lab for implementing the MusicVAE Sequencer UI.
We thank Nikhil Thorat for assistance with the deeplearn.js
implementation. We thank the members of the Magenta team
for helpful discussions and observations.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <year>2015</year>
          .
          <article-title>Web MIDI API</article-title>
          . http://webaudio.github.io/web-midi-api/. (
          <year>2015</year>
          ). Accessed:
          <fpage>2017</fpage>
          -12-21.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Martín</given-names>
            <surname>Abadi</surname>
          </string-name>
          and others.
          <source>2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. White Paper</source>
          . Google. http: //download.tensorflow.org/paper/whitepaper2015.pdf
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Mason</given-names>
            <surname>Bretan</surname>
          </string-name>
          , Sageev Oore, Jesse Engel, Douglas Eck, and
          <string-name>
            <given-names>Larry</given-names>
            <surname>Heck</surname>
          </string-name>
          . 2017a.
          <article-title>Deep Music: Towards Musical Dialogue</article-title>
          .
          <source>In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17).</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Mason</given-names>
            <surname>Bretan</surname>
          </string-name>
          , Gil Weinberg, and
          <string-name>
            <given-names>Larry</given-names>
            <surname>Heck</surname>
          </string-name>
          .
          <year>2017b</year>
          .
          <article-title>A Unit Selection Methodology for Music Generation Using Deep Neural Networks</article-title>
          .
          <source>In International Conference on Computational Creativity (ICCC</source>
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Brown</surname>
          </string-name>
          and
          <string-name>
            <given-names>Rene</given-names>
            <surname>Wooller</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Investigating Morphing Algorithms for Generative Music</article-title>
          . In Proceedings of Third Iteration,
          <string-name>
            <given-names>T</given-names>
            <surname>Innocent</surname>
          </string-name>
          (Ed.).
          <article-title>Centre for Electronic Media Art</article-title>
          , Australia, Victoria, Melbourne,
          <fpage>189</fpage>
          -
          <lpage>198</lpage>
          . https://eprints.qut.edu.au/24696/
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Brown</surname>
          </string-name>
          , René Wooller, and Kate Thomas.
          <year>2007</year>
          .
          <article-title>The Morph Table: A collaborative interface for musical interaction</article-title>
          .
          <source>In Australasian Computer Music Conference (ACMC)</source>
          .
          <volume>34</volume>
          -
          <fpage>39</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Shan</given-names>
            <surname>Carter</surname>
          </string-name>
          and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Nielsen</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Using Artificial Intelligence to Augment Human Intelligence</article-title>
          .
          <source>Distill</source>
          (
          <year>2017</year>
          ). http://distill.pub/2017/aia
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Jesse</given-names>
            <surname>Engel</surname>
          </string-name>
          , Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas Eck, Karen Simonyan, and
          <string-name>
            <given-names>Mohammad</given-names>
            <surname>Norouzi</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders</article-title>
          .
          <source>In International Conference on Machine Learning (ICML)</source>
          . https://arxiv.org/abs/1704.01279
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Masatoshi</given-names>
            <surname>Hamanaka</surname>
          </string-name>
          , Keiji Hirata, and
          <string-name>
            <given-names>Satoshi</given-names>
            <surname>Tojo</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Melody morphing method based on GTTM</article-title>
          . In International Computer Music Conference (ICMC).
          <volume>89</volume>
          -
          <fpage>92</fpage>
          . http://hdl.handle.net/2027/spo.bbp2372.
          <year>2009</year>
          .020
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>Sepp</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jürgen</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Long Short-Term Memory</article-title>
          .
          <source>Neural Computation</source>
          <volume>9</volume>
          ,
          <issue>8</issue>
          (
          <year>1997</year>
          ),
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          . DOI: http://dx.doi.org/10.1162/neco.
          <year>1997</year>
          .
          <volume>9</volume>
          .8.
          <fpage>1735</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. IEC 60908
          <year>1987</year>
          .
          <article-title>Audio recording - Compact disc digital audio system</article-title>
          .
          <source>Standard. International Electrotechnical Commission (IEC).</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>Tristan</given-names>
            <surname>Jehan</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>Perceptual synthesis engine: an audio-driven timbre generator</article-title>
          .
          <source>Master's thesis</source>
          . Massachusetts Institute of Technology (MIT). http://hdl.handle.
          <source>net/1721.1/61543</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Diederik</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Kingma</surname>
            and
            <given-names>Max</given-names>
          </string-name>
          <string-name>
            <surname>Welling</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Auto-Encoding Variational Bayes</article-title>
          .
          <source>In International Conference on Learning Representations (ICLR)</source>
          . http://arxiv.org/abs/1312.6114
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>Ian</given-names>
            <surname>Loh</surname>
          </string-name>
          and
          <string-name>
            <given-names>Tom</given-names>
            <surname>White</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>TopoSketch: Drawing in Latent Space</article-title>
          .
          <source>In NIPS Workshop on Machine Learning for Creativity and Design</source>
          . https: //nips2017creativity.github.io/doc/TopoSketch.pdf
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>Yotam</given-names>
            <surname>Mann</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Interactive Music with Tone.js</article-title>
          .
          <source>In Web Audio Conference (WAC)</source>
          . http://wac.ircam.fr/pdf/wac15_submission_40.pdf
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>Parag</given-names>
            <surname>Mital</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Generate your own sounds with NSynth</article-title>
          . https://magenta.tensorflow.org/nsynth-fastgen. (
          <year>2017</year>
          ). Accessed:
          <fpage>2017</fpage>
          -12-21.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Donald</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Norman</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>The Design of Everyday Things</article-title>
          . Basic Books, Inc., New York, NY, USA.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <given-names>Sageev</given-names>
            <surname>Oore</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Learning Advanced Skills on New Instruments</article-title>
          .
          <source>In International Conference on New Interfaces for Musical Expression (NIME)</source>
          . Vancouver, Canada.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Daniel</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Oppenheim</surname>
          </string-name>
          .
          <year>1995</year>
          .
          <article-title>Demonstrating MMorph: A System for Morphing Music in Real-time</article-title>
          .
          <source>In International Computer Music Conference (ICMC)</source>
          .
          <volume>479</volume>
          -
          <fpage>480</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Adam</surname>
            <given-names>Roberts</given-names>
          </string-name>
          , Jesse Engel, and
          <string-name>
            <given-names>Douglas</given-names>
            <surname>Eck</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Hierarchical Variational Autoencoders for Music</article-title>
          .
          <source>In NIPS Workshop on Machine Learning for Creativity and Design</source>
          . https://nips2017creativity.github.io/doc/Hierarchical_ Variational_Autoencoders_for_Music.pdf
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21. Daniel Smilkov, Nikhil Thorat, and
          <string-name>
            <given-names>Charles</given-names>
            <surname>Nicholson</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>deeplearn.js: a hardware-accelerated machine intelligence library for the web</article-title>
          . https://deeplearnjs.org/. (
          <year>2017</year>
          ). Accessed:
          <fpage>2017</fpage>
          -12-21.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22. Laurens van der Maaten, Eric Postma, and Jaap van den Herik.
          <year>2009</year>
          .
          <article-title>Dimensionality reduction: A comparative review</article-title>
          .
          <source>Technical Report</source>
          . TiCC, Tilburg University.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Jun-Yan</surname>
            <given-names>Zhu</given-names>
          </string-name>
          , Philipp Krähenbühl, Eli Shechtman, and
          <string-name>
            <surname>Alexei</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Efros</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Generative visual manipulation on the natural image manifold</article-title>
          .
          <source>In European Conference on Computer Vision</source>
          (ECCV). https://arxiv.org/abs/1609.03552
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>