=Paper= {{Paper |id=Vol-1144/paper9 |storemode=property |title=Developing a Machine Learning Approach to Controlling Musical Synthesizer Parameters in Real-Time Live Performance |pdfUrl=https://ceur-ws.org/Vol-1144/paper9.pdf |volume=Vol-1144 |dblpUrl=https://dblp.org/rec/conf/maics/SommerR14 }} ==Developing a Machine Learning Approach to Controlling Musical Synthesizer Parameters in Real-Time Live Performance== https://ceur-ws.org/Vol-1144/paper9.pdf

Towards a Machine Learning Based Control
of Musical Synthesizers in Real-Time Live Performance
Nathan Sommer and Anca Ralescu
EECS Department
University of Cincinnati, ML0030
Cincinnati, OH 45221-0030, USA
sommernw@mail.uc.edu, anca.ralescu@uc.edu

Abstract
Musicians who play synthesizers often adjust synthesis
parameters during live performance to achieve a more
expressive sound. Training a computer to make auto-
matic parameter adjustments based on examples pro-
vided by the performer frees the performer from this
responsibility while maintaining an expressive sound in
line with the performer’s desired aesthetic. This paper
is an overview of ongoing research to explore the ef-
fectiveness of using Long Short-Term Memory (LSTM)
recurrent neural networks to accomplish this task.

Introduction
Electronic sound synthesizers have been used as musical in-
struments for more than a century, and musicians and re-
searchers continue to explore new ways to synthesize inter-
esting and expressive sounds. Approaches to allow humans
to control synthesizers include woodwind style controllers,
guitar and other stringed instrument style controllers, and
controllers that map gestures to musical events. However,
Figure 1: Basic functionality of the proposed system. Note events
the most popular synthesizer controller continues to be the are received by the system and allow it to continually update the
the piano-style keyboard. current musical context which is continually fed through the LSTM
The keyboard remains an attractive controller because it network. Note events are passed through to the synthesizer along
is familiar to many musicians, and because it is a natural with generated synthesizer parameter change events.
way to tell a synthesizer to start or stop playing a sound.
When a key on a keyboard is depressed, a message is sent to
start playing a sound at a certain frequency. When the key is control synthesis parameters is to use a computer to con-
released, a message is sent to tell the synthesizer to stop. trol the parameters. Many modern synthesizers are hard-
What’s missing from this is a way to control the quality ware or software modules that are not directly connected to
of the sound once a key has been depressed. Wind instru- a keyboard. Multiple protocols exist to control these syn-
ments allow the musician to alter the quality of the sound thesizers, such as MIDI and OSC. Keyboards, other types of
through breath and mouth control, and bowed string instru- controllers, and computers can send messages to these syn-
ments allow for different sounds through different bowing thesizers telling them to start or stop playing a sound, or to
techniques. To allow for similar expressive sound adjust- change a synthesis parameter value. With a such a setup, a
ments, most synthesizers have a number of parameters that computer program can be used to monitor messages from a
can be adjusted via knobs, sliders, wheels, pedals, and other keyboard controller and alter the synthesis parameters in real
methods. This allows for a great deal of sound control, but time based on what the human is playing. This paper pro-
the number of parameters that can be controlled simultane- poses a synthesizer control system, the Middleman, which
ously is limited by the number of hands and feet the per- implements such a setup, and is illustrated in Figure 1.
former has, and often the performer would like to use both Because different musicians have different desired aes-
hands simultaneously to play the keyboard. thetics, there is no one correct way to shape synthesis param-
One way to allow for a more expressive sound during per- eters over time during a performance. Ideally a performer
formance without requiring the human performer to directly would be able to teach a machine to learn to control the
parameters in a way that is consistent with the performer’s One of the easiest instruments with which to accomplish
desired aesthetic. This paper explores the extent to which this task is the piano. Unlike many wind and string instru-
machine learning techniques, specifically a class of neural ments, the piano does not allow for subtle control of dynam-
networks, can be used to achieve such control in a manner ics, timbre, and intonation once a note has been struck. Be-
comparable to that of a human musician. cause the piano can only play discrete notes there is no con-
trol over intonation, and the timbre can only be controlled
Expressive Music Performance in the same manner as the dynamics, in how hard the key is
struck.
Expressive music performance has been of particular inter-
Due to this, a piano performance can easily be described
est to researchers in the last decade, and much work has been
with key and pedal events rather than recorded audio, and
done to attempt to model expressive human performance
mechanical pianos can be controlled by a computer. There
with machines (Kirke and Miranda, 2009). These models
is an annual piano performance rendering contest called
can be used to generate expressive musical performances by
Rencon (http://renconmusic.org/) which evalu-
machine alone, and also for collaborative performances be-
ates performance rendering systems’ abilities at performing
tween human and machine.
unseen musical scores on the piano which continues to push
When human musicians perform a piece of music from a
progress in this area (Widmer, Flossmann, and Grachten,
written score, they inject their own expression and aesthetic
2009). An interesting aspect of this competition is that while
into the music by varying the following musical parameters,
the contestants are computer systems, the judges are human
or performance actions (Kirke and Miranda, 2009):
and therefore the evaluations are highly subjective.
• tempo, the speed of the music
• dynamics, how loud the notes are played
Human Computer Collaborative Performance
Other research explores ways in which humans and com-
• articulation, the transitions between notes puters can collaboratively contribute to expressive perfor-
• intonation, pitch accuracy mances. The OMax system (Assayag et al., 2006) allows for
improvisational collaboration in real time by listening to the
• timbre, the quality of the sound
human musician, learning features of the musician’s style,
A musician may only be able to vary a subset of these actions and playing along interactively. Music Plus One (Raphael,
depending on the instrument played. For example, a saxo- 2010) is an automatic accompaniment system which plays
phone player can vary the timbre and intonation of a note as a scored piece along with a soloist, following the soloist’s
it is being played by altering the tightness of the mouth on tempo. In addition to the piano performance rending con-
the mouthpiece of the instrument, but a piano player cannot test, Rencon also has a “semi-automatic and interactive” cat-
achieve this at all. egory, which in 2013 was won by VirtualPhilharmony, a sys-
Traditionally, music composers and arrangers provide in- tem that allows a human to conduct a virtual orchestra.
formation about how pieces of music are intended to be per- All of these examples are exciting works that showcase
formed through scores. Scores contain notation that tells the extent to which humans and computers can work to-
musicians how loud to play, when to speed up and slow gether to make music, but none of the collaborative systems
down, what articulation to use for what notes, etc. The mu- developed so far address the problem put forth in this paper
sician ultimately decides how to interpret this information, – allowing a human performer to control the pitch, tempo,
and adds expressive subtlety to a piece of music during per- dynamics, and articulation of a performance, while a com-
formance that cannot be conveyed in a score alone. puter controls timbre and intonation by varying sound syn-
thesis parameters.
Expressive Computer Music Performance We hypothesize that a machine learning approach to this
Similarly to a human performer, a computer can perform a problem can be successful. Allowing musicians to train
piece of music on an electronic instrument as it is written in the system with example performances created by the musi-
a score. However, it is difficult for a computer to perform a cians themselves will result in performances that are unique
piece of music with the same expressive subtlety as a human and adhere to the performers’ visions. This problem also
performer. Computers are good at following rules, but the presents unique challenges from a machine learning per-
rules for musical expression are difficult to define. Differ- spective, which will be discussed in later sections.
ent styles of music have different rules of expression, and
often what makes a particular performance of a piece of mu- Creating Performance Examples
sic interesting is how the musician plays with the listener’s Often when creating recordings in a studio, synthesizer parts
expectations of such rules. are initially recorded as events rather than as audio. This
One way researchers have tried to get computers to per- allows musicians to separate the recording of note events
form music in an expressive way is by learning expression from the recording of synthesizer parameter change events.
rules from human performances (Widmer, 2001). Perfor- After a keyboard performance has been recorded, parameter
mance examples are used which include a musical score changes can be recorded to achieve the desired sound over
along with a recorded human performance, and algorithms the course of the recording. In this way, musicians can create
are employed to explicitly or implicitly extract performance interesting and expressive synthesizer recordings that could
rules that a computer can use to perform unseen scores. not be performed live by a single musician.
These studio event recordings can be used as training ex-
amples for our system. If the system can learn to reproduce
the desired parameter changes while a human is performing,
temporally changing sounds that were previously only at-
tainable in a studio can be brought to life during live perfor-
mances. This method of creating training examples is natu-
ral because musicians are already accustomed to recording
this way, and allows them to use the synthesizers they are
already familiar with.

Learning with Time Series Data
Figure 2: LSTM block with a single memory cell, taken from Gers,
Many tasks which have been tackled with machine learning Schraudolph, and Schmidhuber (2003). The gates and cell input
approaches involve time series data. Some of these tasks, activations are calculated by passing the weighted sum of incom-
such as speech recognition, involve finding patterns in time ing connections through an activation function, as with a standard
series data. Other tasks, such as numerous types of forecast- artificial neural network node. Input to the cell is scaled by the in-
ing, involve predicting what will happen in the future based put gate’s activation, output from the cell is scaled by the output
on what has happened in the past. gate’s activation, and the cell’s state is scaled by the forget gate’s
For this particular problem we are concerned with predict- activation.
ing the immediate future values of synthesizer parameters.
These values must be predicted based on two things:
Neural Network (RNN) (Jordan, 1986; Elman, 1990). RNNs
• past parameter values contain nodes which retain activation values from the previ-
• the current musical context of the piece being played. ous time step, and contain recurrent connections from those
nodes to other nodes in the network. In this manner, data
The important aspects of the musical context that affect can be fed into the network one time step per forward pass,
the parameter levels are defined implicitly by a musician and the network will learn to take into account information
through training examples, and so the learning system em- from past time steps when calculating output for the current
ployed must be able to discover those aspects. The current time step.
musical context at any time step during a musical perfor- Simple RNNs as described above are generally trained
mance is dependent on events that have happened at previ- using Backpropagation Through Time (BPTT). Using this
ous time steps, and so the system must have an awareness of method, errors at the current time step flow backwards
the past. Which past events are important and how long they through previous time steps in order to calculate changes
remain important will differ for each set of training exam- in the network’s weights. However, these errors either van-
ples, and so the system must be flexible in that regard. The ish or blow up as they travel backwards in time. As a result,
following sections discuss some techniques that have been simple RNNs cannot learn well when relevant events happen
used to achieve this goal in other problem domains. more than 5-10 time steps in the past (Gers, Schmidhuber,
and Cummins, 2000).
Recurrent Neural Networks
Artificial neural networks have long been useful tools in ma- LSTM Networks
chine learning due to their ability to approximate non-linear Long Short-Term Memory (LSTM) RNNs overcome this
functions. While the standard artificial neural network can limitation (Gers, Schmidhuber, and Cummins, 2000).
be useful for some time series data learning tasks, there are LSTM networks contain memory cells which retain values
limitations to the model when applied to time series data. between forward passes through the network. The networks
After a standard neural network is trained, data is input also contain three types of specialized activation units called
via input nodes. In each subsequent layer, node activations gates. A group of memory cells and their associated gates
are calculated by applying activation functions to weighted are organized into blocks. Input gates control write access
sums of the previous layer’s activation values. No infor- to the memory cells by scaling values that are to be added
mation about previous activation values is stored, and so to the memory cells; output gates control read access from
the network is in the same state before each forward pass. the memory cells by scaling the values that are output by the
Therefore, in order for such a network to accept time series cells; forget gates allow the memory cell to periodically reset
data as input it must receive a window of data, or all the by scaling the cell’s current value. The gates have weighted
data after time step t − n up to time step t, where t is the connections from the input as well as recurrent connections
latest time step under consideration and n is the size of the from the other gates and memory cells. During training the
window. gates learn to open and close so that the memory cells can
This sliding window approach has limitations because store and accumulate values, keep them for arbitrary peri-
events relevant to time step t could have occurred at or be- ods of time, and use the cells’ values to affect the output as
fore time step t−n, yet they will not be taken into considera- needed.
tion by the network because they are outside of the window. LSTM networks have been used successfully in both
An alternative to the window approach is to use a Recurrent recognizing patterns from sequences and in generating se-
quences. Most notably they have proven highly effective This input scheme captures pitch, timing, and dynamics,
in handwriting recognition and generation (Graves, 2013). and hypothetically provides enough information for a prop-
There have been musical applications of LSTM: LSTM net- erly trained network to be able to extract any musical con-
works have been used to generate musical chord progres- text that might be relevant. However, it remains to be seen
sions (Eck and Schmidhuber, 2002), and for learning mu- if such training is feasible. Having one input value for each
sical structure from scores (Eck and Lapalme, 2008), but key on the keyboard might require significantly larger net-
these applications operate on the level of notes and chords works which introduce longer training times and additional
and do not operate in real time. Our project explores how processing latency.
effectively LSTM networks can be trained to implicitly ex- A third option is to explicitly provide the network with
tract relevant high level musical structures from low level higher level musical features. Melodic information can
time series input, and use that information to control sound be determined based on the intervals between consecu-
synthesis parameters. tive notes. Expressive computer music performance sys-
tems (Widmer, Flossmann, and Grachten, 2009; Arcos,
Specific Challenges De Mántaras, and Serra, 1998) have had success using the
Implication-Realization model (Narmour, 1990), which can
In order for this system to be successful, several challenges
be used to classify small local melodic structures. Such
must be overcome. The learning system is to be given the
information can be determined from a monophonic perfor-
musical context at each time step, and so the most effec-
mance, but becomes difficult if the performance is highly
tive manner of encoding the current musical context must
polyphonic.
be determined. The system must run in real time, thus care
Harmonic information can be explicitly determined if
must be taken to ensure it can continually predict parame-
multiple notes are being played at once. The same har-
ter levels quickly enough. Generalization is an issue with
monic intervals can mean different things depending on the
any machine learning task, and here one must be careful not
key of the piece. Therefore, to use this information as input,
to overfit to the provided training examples so that the sys-
the system must be trained for a specific key or the training
tem can generalize well. Finally, suitable metrics must be
data must be transposed to different keys during training to
devised to evaluate the success of the system.
achieve generalization across keys.
Capturing the Musical Context Real Time Performance
When this system is used, it will be given a regularly updated Musicians using this system will be playing their keyboards,
state of the current musical context. This information must or whatever controllers they prefer. Every time a note is
be determined from the stream of note on and note off events played, the system must predict the values of the synthesizer
received from the keyboard and should provide enough con- parameters and set them before the note event is sent to the
text to the learning system so that it can effectively predict synthesizer. This ensures that the parameters are set before
the synthesizer parameter levels. the note is played so that the synthesizer creates the desired
One very simple way to capture the context is to have a sound. Thus the system must be able to run fast enough that
single input value which represents whether or not a note the total latency from the time at which the note is depressed
is currently being played. When a note on event is sent to the time at which the note is played on the synthesizer is
from a controller, it has two values: pitch and velocity. The within the musician’s acceptable latency tolerance.
pitch value indicates which key was depressed on the key- In preliminary testing, relatively small LSTM networks
board, and the velocity value indicates how hard the key have been able to complete a forward pass in approximately
was struck. Given the normalized velocity v of the last note 50 microseconds on a mid-range laptop. For most mu-
played, where 0 < v ≤ 1, a very simple single input x0 sicians, added latency does not become an issue until it
looks like this: reaches several milliseconds. As this research progresses

0, if no note is depressed the networks are sure to grow in size, increasing the time re-
x0 = quired to complete a forward pass. Because forward pass
v, otherwise
time is the primary contributor to added latency, the net-
This input scheme can capture the timing and dynamics of works cannot grow past a certain size before use will not be
the piece being played, but the pitch is ignored and it cannot satisfactory to the musician. This limit will be continually
take into account polyphonic playing, where the performer evaluated as development of the system progresses.
depresses more than one key at a time. If a musician merely
wants parameter changes based on the timing and dynamics Generalization
of monophonic playing however, this input scheme might be Any good machine learning system must be able to gener-
a good choice. alize well. Generalization becomes an issue here because
Another option is to have one input for each key on the it is impossible for human musicians to play something on
keyboard. For each key i on the keyboard, the input vector an instrument exactly the same way twice, and often musi-
element xi looks like this: cians will play the same musical phrase with slightly differ-
ent tempo, dynamics, and articulation each time for variety.
0, if key i is not depressed This system must be able to make satisfying parameter ad-
xi =
vi , otherwise justments when musicians play phrases that are similar to
the example phrases used to train the system, but not exactly
the same.
Normally it is ideal to train a learning system with a large
data set. Providing as many training examples as possible in-
creases the generalizing power of the learned model. There
are several possible ways to increase the number of training
examples for this particular system. One is for the musician
to play numerous examples of similar musical phrases and
create parameter curves for each. However, this is a rather
cumbersome task and the musician still might not be able to
produce enough examples for satisfactory generalization.
Another way is for the system to alter the data during
training in the same ways that a human might alter the play-
ing of a musical phrase during performance. This involves Figure 3: A simple triangle-shaped parameter modulation over 15
changing the duration and velocity of notes, the tempo at time steps. The network was able to learn to output this shape per-
which the notes are played, and even the notes themselves. fectly after 27, 370 training epochs. It is merely an approximation
Altering the examples during training is similar to gen- of a triangle due to the 16 discrete output levels.
erating expressive computer music performances based on
example human performances. The difference here is that
these altered examples will never be heard, and thus do not ble. Different methods for capturing the musical context
need to sound like authentic human performances. There is and different network topologies and parameters can be ob-
generally a consistency to the way human performers make jectively compared based on the level of error minimization
changes to the timing and dynamics of a musical phrase to achieved during training and, in the case of similar results,
achieve an expressive performance. For example, if a per- the amount of time taken to train. These objective metrics
former is playing an upward moving phrase, he or she might can be compared with subjective evaluations of performance
increase the tempo and dynamics as the phrase moves up- to ensure that optimizing the objective metrics correlates
wards. It could sound musically awkward to speed up and with improved subjective evaluation.
slow down multiple times during the same upward moving
phrase, and as such is something that an expressive perfor- Preliminary Results
mance generating system would want to avoid. However, if Much of the work so far has been on developing a custom
one is only concerned with creating an altered training ex- LSTM implementation and conducting training experiments
ample for the sake of generalization it is not a concern if the to ensure that LSTM is a suitable learning algorithm for
example does not sound musically correct as a whole. It is this problem. Two simple experiments are presented here.
only important that the individual changes within the phrase The first experiment demonstrates that the LSTM implemen-
are consistent with what might happen note to note during a tation presented here can learn to output a basic triangle-
performance. shaped temporal parameter change on demand. The second
experiment shows that LSTM is capable of learning to de-
Evaluation and Measuring Success tect higher level articulation patterns from an input stream
It is important to establish metrics to determine the level of and output different values based on the temporal position
success of this approach. Here there are both subjective and within a pattern.
objective measures of quality to consider when evaluating
performance. Simple Temporal Modulation Experiment
To subjectively evaluate such a system, it needs to be put This experiment was devised to determine how well an
in the hands of a variety of musicians. Different musicians LSTM network can learn to output a simple triangle-shaped
will have different ideas of how to use it, and different ways parameter modulation whenever a key is pressed. This shape
of determining if it lives up to their expectations. After col- is shown in Figure 3.
lecting initial feedback and seeing how musicians use the Rather than outputting a single continuous parameter
system it will be easier to determine more specific subjec- value, the network outputs a vector y which contains 16 val-
tive evaluation criteria. ues representing discrete parameter levels. The predicted
As mentioned before, the system must be able to generate parameter level is selected by finding the maximum element
satisfactory output when presented with musical phrases that in y. Each element of each target vector is set to 0 except for
are similar to the phrases used to train the system. In some the element representing the desired parameter level which
cases, what the system sees as similar and what the musician is set to 1. Training follows the backpropagation algorithm
sees as similar might not agree, and what might be seen as a described in Gers, Schmidhuber, and Cummins (2000).
failure of the system might be due to training examples that Because the output is discrete rather than continuous,
do not properly capture the desired behavior. the output pattern is merely an approximation of a triangle
Objective metrics are easier to define. As with any su- shape.
pervised learning task, the general goal is to minimize train- A single network input x0 represents the current state of
ing error, and to do it in as few training epochs as possi- the controller at the current time step, and has two possible
values:
0, if no note is depressed
x0 =
1, otherwise
Each training sequence consists of 25 subsequences, dis-
tributed as follows:
• One subsequence containing the triangle-shaped modula-
tion. Input for each time step is 1. The output starts at
level 1, rises to level 16, and falls back to 1 over 15 time
steps. This results in output that is a discrete approxima-
tion of the triangle shape.
• Ten subsequences representing silence. Input for all time
steps is 0 and output for all time steps is level 1. The
lengths of the silent sequences range from 9 to 11 time
steps. Prior to each training epoch and each validation
pass, these subsequences are shuffled so that the non- Figure 4: Illustration of the articulation experiment. Shown are
silent section is in a different position within the sequence 5 consecutive staccato notes followed by one normal note. This
for each pass. subsequence starts at time step 500 of a validation run after 90, 010
The topology of the LSTM hidden layer consists of seven training epochs. The parameter level stays at 1 at the onset of the
memory cells, each with input, output, and forget gates. All first note, raises to 2 at the onset of the second note, and raises to
3 at the onset of the third. The parameter level remains at 3 until
units in the hidden layer receive a weighted connection from most of the way through the normal note, at which point it falls
the input, as well as recurrent connections from all other back down to 2 and then 1.
units in the hidden layer. The memory cells are fully con-
nected to the output units. Gates and output units utilize a
sigmoid activation function. • Ten subsequences of silence, each with a random duration
Networks with this topology are able to learn to output the of 1 to 50 time steps. Output is always level 1.
desired shape perfectly over time steps with an input equal
to 1. • Ten subsequences each consisting of a single normal note
with a random duration of 16 to 20 time steps followed by
Articulation Experiment 1 to 4 time steps of silence. Output is always level 1.
For this experiment, training data is generated to simu- • Five subsequences of 3 to 5 staccato notes followed by 1
late notes being played with different articulation. Notes normal note. Output starts at level 1, increases to level 2
with two types of articulation are generated: staccato notes, at the onset of the second note, increases to level 3 at the
which here last for a duration of 8 to 12 time steps followed onset of the third note, decreases to level 2 15 time steps
by 8 to 12 time steps of silence, and normal notes, which into the normal note, and then decreases back to level 1 3
last for a duration of 16 to 20 time steps followed by 1 to 4 time steps later.
time steps of silence.
Prior to each training epoch and each validation pass, these
At each time step the network is to output one of three pa-
subsequences are shuffled to create a unique sequence which
rameter levels as follows: Level 1 is the normal level which
still contains the desired properties.
is set while normal notes are being played and during ex-
As in the previous experiment, the state of the controller
tended periods of silence. If a staccato note is played, the
is passed to the network as a single input x0 which is 0 or 1
parameter level should be increased to level 2 at the onset of
depending on whether or not a key is depressed. The output
the next note. If the next note is also staccato, then the level
vector again represents discrete parameter levels, but in this
should be increased to level 3 on the onset of the note after
case only 3 levels are used.
that. The parameter level will stay at level 3 as long as stac-
The input vector at time step t also contains the parameter
cato notes continue to be played. If a normal note is played
level from time step t − 1. During training the target vector
after a series of 3 or more staccato notes, the level should be
from time step t − 1 is used. During validation all elements
decreased to level 2 after the note has been sustained for 14
of the output vector from time step t − 1 are set to 0 except
time steps, and then decreased to level 1 after 3 more time
for the maximum element, which is set to 1. Feeding the
steps. This behavior is illustrated in Figure 4.
output from t − 1 into the network along with the current
It is worth noting that it would be impossible for a net-
controller state improves training accuracy dramatically.
work to learn to raise the parameter level on the onset of
Perfect output was achieved in this experiment as well,
the first staccato note in a series of staccato notes, because
using the same network topology and parameters as in the
at that point it is impossible to determine for how long the
previous experiment.
note will be sustained. This is a limitation of operating in
real time, and must be kept in mind when creating future
training examples. Future Work
Each training sequence consists of 25 subsequences, dis- It has been shown that LSTM networks are capable of learn-
tributed as follows ing to output simple temporal changes in the presence of a
stimulating input, and that LSTM networks can be trained Jordan, M. I. 1986. Serial order: A parallel distributed
to recognize articulation patterns and to output parameter processing approach. Technical Report ICS Report 8604,
levels based on provided examples. Future experimentation Institute for Cognitive Science, University of California,
will focus on training networks to achieve tasks that depend San Diego.
on pitch and dynamics as well as articulation. Kirke, A., and Miranda, E. R. 2009. A survey of computer
Once satisfactory performance has been established using systems for expressive music performance. ACM Com-
generated data, the system will be tested by various musi- puting Surveys 42(1):3:1–3:41.
cians using their own training examples which will further
expose the strengths and limitations of the system. Narmour, E. 1990. The Analysis and Cognition of Basic
Melodic Structures: The Implication-Realization Model.
Conclusions University of Chicago Press.
Employing a computer system to automatically control Raphael, C. 2010. Music plus one and machine learning.
sound synthesizer parameters during human performance In Proceedings of the 27th International Conference on
is an unexplored problem that warrants continued investi- Machine Learning, ICML 10.
gation. Results from initial experimentation suggest that Widmer, G.; Flossmann, S.; and Grachten, M. 2009. Yqx
LSTM networks have great potential for use in solving this plays chopin. AI Magazine 30(3):35.
problem. Successful application here will hopefully aid oth-
Widmer, G. 2001. Discovering simple rules in complex data:
ers in applying LSTM to other problems that involve contin-
A meta-learning algorithm and some surprising musical
uous real time sequences.
discoveries. Artificial Intelligence 146:129–148.
Adopting a machine learning approach to this problem al-
lows for parameter control that is consistent with perform-
ers’ desired aesthetics, and allows such a system to be used
by musicians that do not possess computer programming
skills. Machine learning applications usually learn from
large aggregations of data sampled from many individuals.
This project puts the teaching power directly in the hands of
individuals to allow them to fully realize their visions.

References
Arcos, J. L.; De Mántaras, R. L.; and Serra, X. 1998. Saxex:
A case-based reasoning system for generating expressive
musical performances*. Journal of New Music Research
27(3):194–210.
Assayag, G.; Bloch, G.; Chemillier, M.; Cont, A.; and Dub-
nov, S. 2006. Omax brothers: A dynamic topology of
agents for improvization learning. In Proceedings of the
1st ACM Workshop on Audio and Music Computing Mul-
timedia, AMCMM ’06, 125–132. New York, NY, USA:
ACM.
Eck, D., and Lapalme, J. 2008. Learning musical structure
directly from sequences of music. University of Montreal,
Department of Computer Science, CP 6128.
Eck, D., and Schmidhuber, J. 2002. Finding temporal struc-
ture in music: Blues improvisation with lstm recurrent
networks. In Neural Networks for Signal Processing XII,
Proceedings of the 2002 IEEE Workshop, 747–756. IEEE.
Elman, J. L. 1990. Finding structure in time. Cognitive
Science 14:179–211.
Gers, F. A.; Schmidhuber, J. A.; and Cummins, F. A. 2000.
Learning to forget: Continual prediction with lstm. Neu-
ral Computation 12(10):2451–2471.
Gers, F. A.; Schraudolph, N. N.; and Schmidhuber, J. 2003.
Learning precise timing with lstm recurrent networks.
The Journal of Machine Learning Research 3:115–143.
Graves, A. 2013. Generating sequences with recurrent neu-
ral networks. CoRR abs/1308.0850.