=Paper= {{Paper |id=Vol-1144/paper9 |storemode=property |title=Developing a Machine Learning Approach to Controlling Musical Synthesizer Parameters in Real-Time Live Performance |pdfUrl=https://ceur-ws.org/Vol-1144/paper9.pdf |volume=Vol-1144 |dblpUrl=https://dblp.org/rec/conf/maics/SommerR14 }} ==Developing a Machine Learning Approach to Controlling Musical Synthesizer Parameters in Real-Time Live Performance== https://ceur-ws.org/Vol-1144/paper9.pdf
                           Towards a Machine Learning Based Control
                      of Musical Synthesizers in Real-Time Live Performance
                                            Nathan Sommer and Anca Ralescu
                                                      EECS Department
                                               University of Cincinnati, ML0030
                                               Cincinnati, OH 45221-0030, USA
                                           sommernw@mail.uc.edu, anca.ralescu@uc.edu


                           Abstract
     Musicians who play synthesizers often adjust synthesis
     parameters during live performance to achieve a more
     expressive sound. Training a computer to make auto-
     matic parameter adjustments based on examples pro-
     vided by the performer frees the performer from this
     responsibility while maintaining an expressive sound in
     line with the performer’s desired aesthetic. This paper
     is an overview of ongoing research to explore the ef-
     fectiveness of using Long Short-Term Memory (LSTM)
     recurrent neural networks to accomplish this task.


                       Introduction
Electronic sound synthesizers have been used as musical in-
struments for more than a century, and musicians and re-
searchers continue to explore new ways to synthesize inter-
esting and expressive sounds. Approaches to allow humans
to control synthesizers include woodwind style controllers,
guitar and other stringed instrument style controllers, and
controllers that map gestures to musical events. However,
                                                                  Figure 1: Basic functionality of the proposed system. Note events
the most popular synthesizer controller continues to be the       are received by the system and allow it to continually update the
the piano-style keyboard.                                         current musical context which is continually fed through the LSTM
   The keyboard remains an attractive controller because it       network. Note events are passed through to the synthesizer along
is familiar to many musicians, and because it is a natural        with generated synthesizer parameter change events.
way to tell a synthesizer to start or stop playing a sound.
When a key on a keyboard is depressed, a message is sent to
start playing a sound at a certain frequency. When the key is     control synthesis parameters is to use a computer to con-
released, a message is sent to tell the synthesizer to stop.      trol the parameters. Many modern synthesizers are hard-
   What’s missing from this is a way to control the quality       ware or software modules that are not directly connected to
of the sound once a key has been depressed. Wind instru-          a keyboard. Multiple protocols exist to control these syn-
ments allow the musician to alter the quality of the sound        thesizers, such as MIDI and OSC. Keyboards, other types of
through breath and mouth control, and bowed string instru-        controllers, and computers can send messages to these syn-
ments allow for different sounds through different bowing         thesizers telling them to start or stop playing a sound, or to
techniques. To allow for similar expressive sound adjust-         change a synthesis parameter value. With a such a setup, a
ments, most synthesizers have a number of parameters that         computer program can be used to monitor messages from a
can be adjusted via knobs, sliders, wheels, pedals, and other     keyboard controller and alter the synthesis parameters in real
methods. This allows for a great deal of sound control, but       time based on what the human is playing. This paper pro-
the number of parameters that can be controlled simultane-        poses a synthesizer control system, the Middleman, which
ously is limited by the number of hands and feet the per-         implements such a setup, and is illustrated in Figure 1.
former has, and often the performer would like to use both           Because different musicians have different desired aes-
hands simultaneously to play the keyboard.                        thetics, there is no one correct way to shape synthesis param-
   One way to allow for a more expressive sound during per-       eters over time during a performance. Ideally a performer
formance without requiring the human performer to directly        would be able to teach a machine to learn to control the
parameters in a way that is consistent with the performer’s         One of the easiest instruments with which to accomplish
desired aesthetic. This paper explores the extent to which       this task is the piano. Unlike many wind and string instru-
machine learning techniques, specifically a class of neural      ments, the piano does not allow for subtle control of dynam-
networks, can be used to achieve such control in a manner        ics, timbre, and intonation once a note has been struck. Be-
comparable to that of a human musician.                          cause the piano can only play discrete notes there is no con-
                                                                 trol over intonation, and the timbre can only be controlled
          Expressive Music Performance                           in the same manner as the dynamics, in how hard the key is
                                                                 struck.
Expressive music performance has been of particular inter-
                                                                    Due to this, a piano performance can easily be described
est to researchers in the last decade, and much work has been
                                                                 with key and pedal events rather than recorded audio, and
done to attempt to model expressive human performance
                                                                 mechanical pianos can be controlled by a computer. There
with machines (Kirke and Miranda, 2009). These models
                                                                 is an annual piano performance rendering contest called
can be used to generate expressive musical performances by
                                                                 Rencon (http://renconmusic.org/) which evalu-
machine alone, and also for collaborative performances be-
                                                                 ates performance rendering systems’ abilities at performing
tween human and machine.
                                                                 unseen musical scores on the piano which continues to push
   When human musicians perform a piece of music from a
                                                                 progress in this area (Widmer, Flossmann, and Grachten,
written score, they inject their own expression and aesthetic
                                                                 2009). An interesting aspect of this competition is that while
into the music by varying the following musical parameters,
                                                                 the contestants are computer systems, the judges are human
or performance actions (Kirke and Miranda, 2009):
                                                                 and therefore the evaluations are highly subjective.
• tempo, the speed of the music
• dynamics, how loud the notes are played
                                                                 Human Computer Collaborative Performance
                                                                 Other research explores ways in which humans and com-
• articulation, the transitions between notes                    puters can collaboratively contribute to expressive perfor-
• intonation, pitch accuracy                                     mances. The OMax system (Assayag et al., 2006) allows for
                                                                 improvisational collaboration in real time by listening to the
• timbre, the quality of the sound
                                                                 human musician, learning features of the musician’s style,
A musician may only be able to vary a subset of these actions    and playing along interactively. Music Plus One (Raphael,
depending on the instrument played. For example, a saxo-         2010) is an automatic accompaniment system which plays
phone player can vary the timbre and intonation of a note as     a scored piece along with a soloist, following the soloist’s
it is being played by altering the tightness of the mouth on     tempo. In addition to the piano performance rending con-
the mouthpiece of the instrument, but a piano player cannot      test, Rencon also has a “semi-automatic and interactive” cat-
achieve this at all.                                             egory, which in 2013 was won by VirtualPhilharmony, a sys-
   Traditionally, music composers and arrangers provide in-      tem that allows a human to conduct a virtual orchestra.
formation about how pieces of music are intended to be per-         All of these examples are exciting works that showcase
formed through scores. Scores contain notation that tells        the extent to which humans and computers can work to-
musicians how loud to play, when to speed up and slow            gether to make music, but none of the collaborative systems
down, what articulation to use for what notes, etc. The mu-      developed so far address the problem put forth in this paper
sician ultimately decides how to interpret this information,     – allowing a human performer to control the pitch, tempo,
and adds expressive subtlety to a piece of music during per-     dynamics, and articulation of a performance, while a com-
formance that cannot be conveyed in a score alone.               puter controls timbre and intonation by varying sound syn-
                                                                 thesis parameters.
Expressive Computer Music Performance                               We hypothesize that a machine learning approach to this
Similarly to a human performer, a computer can perform a         problem can be successful. Allowing musicians to train
piece of music on an electronic instrument as it is written in   the system with example performances created by the musi-
a score. However, it is difficult for a computer to perform a    cians themselves will result in performances that are unique
piece of music with the same expressive subtlety as a human      and adhere to the performers’ visions. This problem also
performer. Computers are good at following rules, but the        presents unique challenges from a machine learning per-
rules for musical expression are difficult to define. Differ-    spective, which will be discussed in later sections.
ent styles of music have different rules of expression, and
often what makes a particular performance of a piece of mu-               Creating Performance Examples
sic interesting is how the musician plays with the listener’s    Often when creating recordings in a studio, synthesizer parts
expectations of such rules.                                      are initially recorded as events rather than as audio. This
   One way researchers have tried to get computers to per-       allows musicians to separate the recording of note events
form music in an expressive way is by learning expression        from the recording of synthesizer parameter change events.
rules from human performances (Widmer, 2001). Perfor-            After a keyboard performance has been recorded, parameter
mance examples are used which include a musical score            changes can be recorded to achieve the desired sound over
along with a recorded human performance, and algorithms          the course of the recording. In this way, musicians can create
are employed to explicitly or implicitly extract performance     interesting and expressive synthesizer recordings that could
rules that a computer can use to perform unseen scores.          not be performed live by a single musician.
   These studio event recordings can be used as training ex-
amples for our system. If the system can learn to reproduce
the desired parameter changes while a human is performing,
temporally changing sounds that were previously only at-
tainable in a studio can be brought to life during live perfor-
mances. This method of creating training examples is natu-
ral because musicians are already accustomed to recording
this way, and allows them to use the synthesizers they are
already familiar with.

         Learning with Time Series Data
                                                                  Figure 2: LSTM block with a single memory cell, taken from Gers,
Many tasks which have been tackled with machine learning          Schraudolph, and Schmidhuber (2003). The gates and cell input
approaches involve time series data. Some of these tasks,         activations are calculated by passing the weighted sum of incom-
such as speech recognition, involve finding patterns in time      ing connections through an activation function, as with a standard
series data. Other tasks, such as numerous types of forecast-     artificial neural network node. Input to the cell is scaled by the in-
ing, involve predicting what will happen in the future based      put gate’s activation, output from the cell is scaled by the output
on what has happened in the past.                                 gate’s activation, and the cell’s state is scaled by the forget gate’s
   For this particular problem we are concerned with predict-     activation.
ing the immediate future values of synthesizer parameters.
These values must be predicted based on two things:
                                                                  Neural Network (RNN) (Jordan, 1986; Elman, 1990). RNNs
• past parameter values                                           contain nodes which retain activation values from the previ-
• the current musical context of the piece being played.          ous time step, and contain recurrent connections from those
                                                                  nodes to other nodes in the network. In this manner, data
The important aspects of the musical context that affect          can be fed into the network one time step per forward pass,
the parameter levels are defined implicitly by a musician         and the network will learn to take into account information
through training examples, and so the learning system em-         from past time steps when calculating output for the current
ployed must be able to discover those aspects. The current        time step.
musical context at any time step during a musical perfor-            Simple RNNs as described above are generally trained
mance is dependent on events that have happened at previ-         using Backpropagation Through Time (BPTT). Using this
ous time steps, and so the system must have an awareness of       method, errors at the current time step flow backwards
the past. Which past events are important and how long they       through previous time steps in order to calculate changes
remain important will differ for each set of training exam-       in the network’s weights. However, these errors either van-
ples, and so the system must be flexible in that regard. The      ish or blow up as they travel backwards in time. As a result,
following sections discuss some techniques that have been         simple RNNs cannot learn well when relevant events happen
used to achieve this goal in other problem domains.               more than 5-10 time steps in the past (Gers, Schmidhuber,
                                                                  and Cummins, 2000).
Recurrent Neural Networks
Artificial neural networks have long been useful tools in ma-     LSTM Networks
chine learning due to their ability to approximate non-linear     Long Short-Term Memory (LSTM) RNNs overcome this
functions. While the standard artificial neural network can       limitation (Gers, Schmidhuber, and Cummins, 2000).
be useful for some time series data learning tasks, there are     LSTM networks contain memory cells which retain values
limitations to the model when applied to time series data.        between forward passes through the network. The networks
   After a standard neural network is trained, data is input      also contain three types of specialized activation units called
via input nodes. In each subsequent layer, node activations       gates. A group of memory cells and their associated gates
are calculated by applying activation functions to weighted       are organized into blocks. Input gates control write access
sums of the previous layer’s activation values. No infor-         to the memory cells by scaling values that are to be added
mation about previous activation values is stored, and so         to the memory cells; output gates control read access from
the network is in the same state before each forward pass.        the memory cells by scaling the values that are output by the
Therefore, in order for such a network to accept time series      cells; forget gates allow the memory cell to periodically reset
data as input it must receive a window of data, or all the        by scaling the cell’s current value. The gates have weighted
data after time step t − n up to time step t, where t is the      connections from the input as well as recurrent connections
latest time step under consideration and n is the size of the     from the other gates and memory cells. During training the
window.                                                           gates learn to open and close so that the memory cells can
   This sliding window approach has limitations because           store and accumulate values, keep them for arbitrary peri-
events relevant to time step t could have occurred at or be-      ods of time, and use the cells’ values to affect the output as
fore time step t−n, yet they will not be taken into considera-    needed.
tion by the network because they are outside of the window.          LSTM networks have been used successfully in both
An alternative to the window approach is to use a Recurrent       recognizing patterns from sequences and in generating se-
quences. Most notably they have proven highly effective             This input scheme captures pitch, timing, and dynamics,
in handwriting recognition and generation (Graves, 2013).        and hypothetically provides enough information for a prop-
There have been musical applications of LSTM: LSTM net-          erly trained network to be able to extract any musical con-
works have been used to generate musical chord progres-          text that might be relevant. However, it remains to be seen
sions (Eck and Schmidhuber, 2002), and for learning mu-          if such training is feasible. Having one input value for each
sical structure from scores (Eck and Lapalme, 2008), but         key on the keyboard might require significantly larger net-
these applications operate on the level of notes and chords      works which introduce longer training times and additional
and do not operate in real time. Our project explores how        processing latency.
effectively LSTM networks can be trained to implicitly ex-          A third option is to explicitly provide the network with
tract relevant high level musical structures from low level      higher level musical features. Melodic information can
time series input, and use that information to control sound     be determined based on the intervals between consecu-
synthesis parameters.                                            tive notes. Expressive computer music performance sys-
                                                                 tems (Widmer, Flossmann, and Grachten, 2009; Arcos,
                  Specific Challenges                            De Mántaras, and Serra, 1998) have had success using the
                                                                 Implication-Realization model (Narmour, 1990), which can
In order for this system to be successful, several challenges
                                                                 be used to classify small local melodic structures. Such
must be overcome. The learning system is to be given the
                                                                 information can be determined from a monophonic perfor-
musical context at each time step, and so the most effec-
                                                                 mance, but becomes difficult if the performance is highly
tive manner of encoding the current musical context must
                                                                 polyphonic.
be determined. The system must run in real time, thus care
                                                                    Harmonic information can be explicitly determined if
must be taken to ensure it can continually predict parame-
                                                                 multiple notes are being played at once. The same har-
ter levels quickly enough. Generalization is an issue with
                                                                 monic intervals can mean different things depending on the
any machine learning task, and here one must be careful not
                                                                 key of the piece. Therefore, to use this information as input,
to overfit to the provided training examples so that the sys-
                                                                 the system must be trained for a specific key or the training
tem can generalize well. Finally, suitable metrics must be
                                                                 data must be transposed to different keys during training to
devised to evaluate the success of the system.
                                                                 achieve generalization across keys.
Capturing the Musical Context                                    Real Time Performance
When this system is used, it will be given a regularly updated   Musicians using this system will be playing their keyboards,
state of the current musical context. This information must      or whatever controllers they prefer. Every time a note is
be determined from the stream of note on and note off events     played, the system must predict the values of the synthesizer
received from the keyboard and should provide enough con-        parameters and set them before the note event is sent to the
text to the learning system so that it can effectively predict   synthesizer. This ensures that the parameters are set before
the synthesizer parameter levels.                                the note is played so that the synthesizer creates the desired
   One very simple way to capture the context is to have a       sound. Thus the system must be able to run fast enough that
single input value which represents whether or not a note        the total latency from the time at which the note is depressed
is currently being played. When a note on event is sent          to the time at which the note is played on the synthesizer is
from a controller, it has two values: pitch and velocity. The    within the musician’s acceptable latency tolerance.
pitch value indicates which key was depressed on the key-           In preliminary testing, relatively small LSTM networks
board, and the velocity value indicates how hard the key         have been able to complete a forward pass in approximately
was struck. Given the normalized velocity v of the last note     50 microseconds on a mid-range laptop. For most mu-
played, where 0 < v ≤ 1, a very simple single input x0           sicians, added latency does not become an issue until it
looks like this:                                                 reaches several milliseconds. As this research progresses
                   
                      0, if no note is depressed                 the networks are sure to grow in size, increasing the time re-
              x0 =                                               quired to complete a forward pass. Because forward pass
                      v, otherwise
                                                                 time is the primary contributor to added latency, the net-
   This input scheme can capture the timing and dynamics of      works cannot grow past a certain size before use will not be
the piece being played, but the pitch is ignored and it cannot   satisfactory to the musician. This limit will be continually
take into account polyphonic playing, where the performer        evaluated as development of the system progresses.
depresses more than one key at a time. If a musician merely
wants parameter changes based on the timing and dynamics         Generalization
of monophonic playing however, this input scheme might be        Any good machine learning system must be able to gener-
a good choice.                                                   alize well. Generalization becomes an issue here because
   Another option is to have one input for each key on the       it is impossible for human musicians to play something on
keyboard. For each key i on the keyboard, the input vector       an instrument exactly the same way twice, and often musi-
element xi looks like this:                                      cians will play the same musical phrase with slightly differ-
                                                                ent tempo, dynamics, and articulation each time for variety.
                    0, if key i is not depressed                 This system must be able to make satisfying parameter ad-
            xi =
                    vi , otherwise                               justments when musicians play phrases that are similar to
the example phrases used to train the system, but not exactly
the same.
   Normally it is ideal to train a learning system with a large
data set. Providing as many training examples as possible in-
creases the generalizing power of the learned model. There
are several possible ways to increase the number of training
examples for this particular system. One is for the musician
to play numerous examples of similar musical phrases and
create parameter curves for each. However, this is a rather
cumbersome task and the musician still might not be able to
produce enough examples for satisfactory generalization.
   Another way is for the system to alter the data during
training in the same ways that a human might alter the play-
ing of a musical phrase during performance. This involves         Figure 3: A simple triangle-shaped parameter modulation over 15
changing the duration and velocity of notes, the tempo at         time steps. The network was able to learn to output this shape per-
which the notes are played, and even the notes themselves.        fectly after 27, 370 training epochs. It is merely an approximation
   Altering the examples during training is similar to gen-       of a triangle due to the 16 discrete output levels.
erating expressive computer music performances based on
example human performances. The difference here is that
these altered examples will never be heard, and thus do not       ble. Different methods for capturing the musical context
need to sound like authentic human performances. There is         and different network topologies and parameters can be ob-
generally a consistency to the way human performers make          jectively compared based on the level of error minimization
changes to the timing and dynamics of a musical phrase to         achieved during training and, in the case of similar results,
achieve an expressive performance. For example, if a per-         the amount of time taken to train. These objective metrics
former is playing an upward moving phrase, he or she might        can be compared with subjective evaluations of performance
increase the tempo and dynamics as the phrase moves up-           to ensure that optimizing the objective metrics correlates
wards. It could sound musically awkward to speed up and           with improved subjective evaluation.
slow down multiple times during the same upward moving
phrase, and as such is something that an expressive perfor-                          Preliminary Results
mance generating system would want to avoid. However, if          Much of the work so far has been on developing a custom
one is only concerned with creating an altered training ex-       LSTM implementation and conducting training experiments
ample for the sake of generalization it is not a concern if the   to ensure that LSTM is a suitable learning algorithm for
example does not sound musically correct as a whole. It is        this problem. Two simple experiments are presented here.
only important that the individual changes within the phrase      The first experiment demonstrates that the LSTM implemen-
are consistent with what might happen note to note during a       tation presented here can learn to output a basic triangle-
performance.                                                      shaped temporal parameter change on demand. The second
                                                                  experiment shows that LSTM is capable of learning to de-
Evaluation and Measuring Success                                  tect higher level articulation patterns from an input stream
It is important to establish metrics to determine the level of    and output different values based on the temporal position
success of this approach. Here there are both subjective and      within a pattern.
objective measures of quality to consider when evaluating
performance.                                                      Simple Temporal Modulation Experiment
   To subjectively evaluate such a system, it needs to be put     This experiment was devised to determine how well an
in the hands of a variety of musicians. Different musicians       LSTM network can learn to output a simple triangle-shaped
will have different ideas of how to use it, and different ways    parameter modulation whenever a key is pressed. This shape
of determining if it lives up to their expectations. After col-   is shown in Figure 3.
lecting initial feedback and seeing how musicians use the            Rather than outputting a single continuous parameter
system it will be easier to determine more specific subjec-       value, the network outputs a vector y which contains 16 val-
tive evaluation criteria.                                         ues representing discrete parameter levels. The predicted
   As mentioned before, the system must be able to generate       parameter level is selected by finding the maximum element
satisfactory output when presented with musical phrases that      in y. Each element of each target vector is set to 0 except for
are similar to the phrases used to train the system. In some      the element representing the desired parameter level which
cases, what the system sees as similar and what the musician      is set to 1. Training follows the backpropagation algorithm
sees as similar might not agree, and what might be seen as a      described in Gers, Schmidhuber, and Cummins (2000).
failure of the system might be due to training examples that         Because the output is discrete rather than continuous,
do not properly capture the desired behavior.                     the output pattern is merely an approximation of a triangle
   Objective metrics are easier to define. As with any su-        shape.
pervised learning task, the general goal is to minimize train-       A single network input x0 represents the current state of
ing error, and to do it in as few training epochs as possi-       the controller at the current time step, and has two possible
values:             
                       0, if no note is depressed
             x0 =
                       1, otherwise
   Each training sequence consists of 25 subsequences, dis-
tributed as follows:
• One subsequence containing the triangle-shaped modula-
   tion. Input for each time step is 1. The output starts at
   level 1, rises to level 16, and falls back to 1 over 15 time
   steps. This results in output that is a discrete approxima-
   tion of the triangle shape.
• Ten subsequences representing silence. Input for all time
   steps is 0 and output for all time steps is level 1. The
   lengths of the silent sequences range from 9 to 11 time
   steps. Prior to each training epoch and each validation
   pass, these subsequences are shuffled so that the non-          Figure 4: Illustration of the articulation experiment. Shown are
   silent section is in a different position within the sequence   5 consecutive staccato notes followed by one normal note. This
   for each pass.                                                  subsequence starts at time step 500 of a validation run after 90, 010
   The topology of the LSTM hidden layer consists of seven         training epochs. The parameter level stays at 1 at the onset of the
memory cells, each with input, output, and forget gates. All       first note, raises to 2 at the onset of the second note, and raises to
                                                                   3 at the onset of the third. The parameter level remains at 3 until
units in the hidden layer receive a weighted connection from       most of the way through the normal note, at which point it falls
the input, as well as recurrent connections from all other         back down to 2 and then 1.
units in the hidden layer. The memory cells are fully con-
nected to the output units. Gates and output units utilize a
sigmoid activation function.                                       • Ten subsequences of silence, each with a random duration
   Networks with this topology are able to learn to output the       of 1 to 50 time steps. Output is always level 1.
desired shape perfectly over time steps with an input equal
to 1.                                                              • Ten subsequences each consisting of a single normal note
                                                                     with a random duration of 16 to 20 time steps followed by
Articulation Experiment                                              1 to 4 time steps of silence. Output is always level 1.
For this experiment, training data is generated to simu-           • Five subsequences of 3 to 5 staccato notes followed by 1
late notes being played with different articulation. Notes           normal note. Output starts at level 1, increases to level 2
with two types of articulation are generated: staccato notes,        at the onset of the second note, increases to level 3 at the
which here last for a duration of 8 to 12 time steps followed        onset of the third note, decreases to level 2 15 time steps
by 8 to 12 time steps of silence, and normal notes, which            into the normal note, and then decreases back to level 1 3
last for a duration of 16 to 20 time steps followed by 1 to 4        time steps later.
time steps of silence.
                                                                   Prior to each training epoch and each validation pass, these
   At each time step the network is to output one of three pa-
                                                                   subsequences are shuffled to create a unique sequence which
rameter levels as follows: Level 1 is the normal level which
                                                                   still contains the desired properties.
is set while normal notes are being played and during ex-
                                                                      As in the previous experiment, the state of the controller
tended periods of silence. If a staccato note is played, the
                                                                   is passed to the network as a single input x0 which is 0 or 1
parameter level should be increased to level 2 at the onset of
                                                                   depending on whether or not a key is depressed. The output
the next note. If the next note is also staccato, then the level
                                                                   vector again represents discrete parameter levels, but in this
should be increased to level 3 on the onset of the note after
                                                                   case only 3 levels are used.
that. The parameter level will stay at level 3 as long as stac-
                                                                      The input vector at time step t also contains the parameter
cato notes continue to be played. If a normal note is played
                                                                   level from time step t − 1. During training the target vector
after a series of 3 or more staccato notes, the level should be
                                                                   from time step t − 1 is used. During validation all elements
decreased to level 2 after the note has been sustained for 14
                                                                   of the output vector from time step t − 1 are set to 0 except
time steps, and then decreased to level 1 after 3 more time
                                                                   for the maximum element, which is set to 1. Feeding the
steps. This behavior is illustrated in Figure 4.
                                                                   output from t − 1 into the network along with the current
   It is worth noting that it would be impossible for a net-
                                                                   controller state improves training accuracy dramatically.
work to learn to raise the parameter level on the onset of
                                                                      Perfect output was achieved in this experiment as well,
the first staccato note in a series of staccato notes, because
                                                                   using the same network topology and parameters as in the
at that point it is impossible to determine for how long the
                                                                   previous experiment.
note will be sustained. This is a limitation of operating in
real time, and must be kept in mind when creating future
training examples.                                                                          Future Work
   Each training sequence consists of 25 subsequences, dis-        It has been shown that LSTM networks are capable of learn-
tributed as follows                                                ing to output simple temporal changes in the presence of a
stimulating input, and that LSTM networks can be trained        Jordan, M. I. 1986. Serial order: A parallel distributed
to recognize articulation patterns and to output parameter        processing approach. Technical Report ICS Report 8604,
levels based on provided examples. Future experimentation         Institute for Cognitive Science, University of California,
will focus on training networks to achieve tasks that depend      San Diego.
on pitch and dynamics as well as articulation.                  Kirke, A., and Miranda, E. R. 2009. A survey of computer
   Once satisfactory performance has been established using       systems for expressive music performance. ACM Com-
generated data, the system will be tested by various musi-        puting Surveys 42(1):3:1–3:41.
cians using their own training examples which will further
expose the strengths and limitations of the system.             Narmour, E. 1990. The Analysis and Cognition of Basic
                                                                  Melodic Structures: The Implication-Realization Model.
                      Conclusions                                 University of Chicago Press.
Employing a computer system to automatically control            Raphael, C. 2010. Music plus one and machine learning.
sound synthesizer parameters during human performance             In Proceedings of the 27th International Conference on
is an unexplored problem that warrants continued investi-         Machine Learning, ICML 10.
gation. Results from initial experimentation suggest that       Widmer, G.; Flossmann, S.; and Grachten, M. 2009. Yqx
LSTM networks have great potential for use in solving this       plays chopin. AI Magazine 30(3):35.
problem. Successful application here will hopefully aid oth-
                                                                Widmer, G. 2001. Discovering simple rules in complex data:
ers in applying LSTM to other problems that involve contin-
                                                                 A meta-learning algorithm and some surprising musical
uous real time sequences.
                                                                 discoveries. Artificial Intelligence 146:129–148.
   Adopting a machine learning approach to this problem al-
lows for parameter control that is consistent with perform-
ers’ desired aesthetics, and allows such a system to be used
by musicians that do not possess computer programming
skills. Machine learning applications usually learn from
large aggregations of data sampled from many individuals.
This project puts the teaching power directly in the hands of
individuals to allow them to fully realize their visions.

                       References
Arcos, J. L.; De Mántaras, R. L.; and Serra, X. 1998. Saxex:
  A case-based reasoning system for generating expressive
  musical performances*. Journal of New Music Research
  27(3):194–210.
Assayag, G.; Bloch, G.; Chemillier, M.; Cont, A.; and Dub-
  nov, S. 2006. Omax brothers: A dynamic topology of
  agents for improvization learning. In Proceedings of the
  1st ACM Workshop on Audio and Music Computing Mul-
  timedia, AMCMM ’06, 125–132. New York, NY, USA:
  ACM.
Eck, D., and Lapalme, J. 2008. Learning musical structure
  directly from sequences of music. University of Montreal,
  Department of Computer Science, CP 6128.
Eck, D., and Schmidhuber, J. 2002. Finding temporal struc-
  ture in music: Blues improvisation with lstm recurrent
  networks. In Neural Networks for Signal Processing XII,
  Proceedings of the 2002 IEEE Workshop, 747–756. IEEE.
Elman, J. L. 1990. Finding structure in time. Cognitive
  Science 14:179–211.
Gers, F. A.; Schmidhuber, J. A.; and Cummins, F. A. 2000.
  Learning to forget: Continual prediction with lstm. Neu-
  ral Computation 12(10):2451–2471.
Gers, F. A.; Schraudolph, N. N.; and Schmidhuber, J. 2003.
  Learning precise timing with lstm recurrent networks.
  The Journal of Machine Learning Research 3:115–143.
Graves, A. 2013. Generating sequences with recurrent neu-
  ral networks. CoRR abs/1308.0850.