Towards a Machine Learning Based Control of Musical Synthesizers in Real-Time Live Performance Nathan Sommer and Anca Ralescu EECS Department University of Cincinnati, ML0030 Cincinnati, OH 45221-0030, USA sommernw@mail.uc.edu, anca.ralescu@uc.edu Abstract Musicians who play synthesizers often adjust synthesis parameters during live performance to achieve a more expressive sound. Training a computer to make auto- matic parameter adjustments based on examples pro- vided by the performer frees the performer from this responsibility while maintaining an expressive sound in line with the performer’s desired aesthetic. This paper is an overview of ongoing research to explore the ef- fectiveness of using Long Short-Term Memory (LSTM) recurrent neural networks to accomplish this task. Introduction Electronic sound synthesizers have been used as musical in- struments for more than a century, and musicians and re- searchers continue to explore new ways to synthesize inter- esting and expressive sounds. Approaches to allow humans to control synthesizers include woodwind style controllers, guitar and other stringed instrument style controllers, and controllers that map gestures to musical events. However, Figure 1: Basic functionality of the proposed system. Note events the most popular synthesizer controller continues to be the are received by the system and allow it to continually update the the piano-style keyboard. current musical context which is continually fed through the LSTM The keyboard remains an attractive controller because it network. Note events are passed through to the synthesizer along is familiar to many musicians, and because it is a natural with generated synthesizer parameter change events. way to tell a synthesizer to start or stop playing a sound. When a key on a keyboard is depressed, a message is sent to start playing a sound at a certain frequency. When the key is control synthesis parameters is to use a computer to con- released, a message is sent to tell the synthesizer to stop. trol the parameters. Many modern synthesizers are hard- What’s missing from this is a way to control the quality ware or software modules that are not directly connected to of the sound once a key has been depressed. Wind instru- a keyboard. Multiple protocols exist to control these syn- ments allow the musician to alter the quality of the sound thesizers, such as MIDI and OSC. Keyboards, other types of through breath and mouth control, and bowed string instru- controllers, and computers can send messages to these syn- ments allow for different sounds through different bowing thesizers telling them to start or stop playing a sound, or to techniques. To allow for similar expressive sound adjust- change a synthesis parameter value. With a such a setup, a ments, most synthesizers have a number of parameters that computer program can be used to monitor messages from a can be adjusted via knobs, sliders, wheels, pedals, and other keyboard controller and alter the synthesis parameters in real methods. This allows for a great deal of sound control, but time based on what the human is playing. This paper pro- the number of parameters that can be controlled simultane- poses a synthesizer control system, the Middleman, which ously is limited by the number of hands and feet the per- implements such a setup, and is illustrated in Figure 1. former has, and often the performer would like to use both Because different musicians have different desired aes- hands simultaneously to play the keyboard. thetics, there is no one correct way to shape synthesis param- One way to allow for a more expressive sound during per- eters over time during a performance. Ideally a performer formance without requiring the human performer to directly would be able to teach a machine to learn to control the parameters in a way that is consistent with the performer’s One of the easiest instruments with which to accomplish desired aesthetic. This paper explores the extent to which this task is the piano. Unlike many wind and string instru- machine learning techniques, specifically a class of neural ments, the piano does not allow for subtle control of dynam- networks, can be used to achieve such control in a manner ics, timbre, and intonation once a note has been struck. Be- comparable to that of a human musician. cause the piano can only play discrete notes there is no con- trol over intonation, and the timbre can only be controlled Expressive Music Performance in the same manner as the dynamics, in how hard the key is struck. Expressive music performance has been of particular inter- Due to this, a piano performance can easily be described est to researchers in the last decade, and much work has been with key and pedal events rather than recorded audio, and done to attempt to model expressive human performance mechanical pianos can be controlled by a computer. There with machines (Kirke and Miranda, 2009). These models is an annual piano performance rendering contest called can be used to generate expressive musical performances by Rencon (http://renconmusic.org/) which evalu- machine alone, and also for collaborative performances be- ates performance rendering systems’ abilities at performing tween human and machine. unseen musical scores on the piano which continues to push When human musicians perform a piece of music from a progress in this area (Widmer, Flossmann, and Grachten, written score, they inject their own expression and aesthetic 2009). An interesting aspect of this competition is that while into the music by varying the following musical parameters, the contestants are computer systems, the judges are human or performance actions (Kirke and Miranda, 2009): and therefore the evaluations are highly subjective. • tempo, the speed of the music • dynamics, how loud the notes are played Human Computer Collaborative Performance Other research explores ways in which humans and com- • articulation, the transitions between notes puters can collaboratively contribute to expressive perfor- • intonation, pitch accuracy mances. The OMax system (Assayag et al., 2006) allows for improvisational collaboration in real time by listening to the • timbre, the quality of the sound human musician, learning features of the musician’s style, A musician may only be able to vary a subset of these actions and playing along interactively. Music Plus One (Raphael, depending on the instrument played. For example, a saxo- 2010) is an automatic accompaniment system which plays phone player can vary the timbre and intonation of a note as a scored piece along with a soloist, following the soloist’s it is being played by altering the tightness of the mouth on tempo. In addition to the piano performance rending con- the mouthpiece of the instrument, but a piano player cannot test, Rencon also has a “semi-automatic and interactive” cat- achieve this at all. egory, which in 2013 was won by VirtualPhilharmony, a sys- Traditionally, music composers and arrangers provide in- tem that allows a human to conduct a virtual orchestra. formation about how pieces of music are intended to be per- All of these examples are exciting works that showcase formed through scores. Scores contain notation that tells the extent to which humans and computers can work to- musicians how loud to play, when to speed up and slow gether to make music, but none of the collaborative systems down, what articulation to use for what notes, etc. The mu- developed so far address the problem put forth in this paper sician ultimately decides how to interpret this information, – allowing a human performer to control the pitch, tempo, and adds expressive subtlety to a piece of music during per- dynamics, and articulation of a performance, while a com- formance that cannot be conveyed in a score alone. puter controls timbre and intonation by varying sound syn- thesis parameters. Expressive Computer Music Performance We hypothesize that a machine learning approach to this Similarly to a human performer, a computer can perform a problem can be successful. Allowing musicians to train piece of music on an electronic instrument as it is written in the system with example performances created by the musi- a score. However, it is difficult for a computer to perform a cians themselves will result in performances that are unique piece of music with the same expressive subtlety as a human and adhere to the performers’ visions. This problem also performer. Computers are good at following rules, but the presents unique challenges from a machine learning per- rules for musical expression are difficult to define. Differ- spective, which will be discussed in later sections. ent styles of music have different rules of expression, and often what makes a particular performance of a piece of mu- Creating Performance Examples sic interesting is how the musician plays with the listener’s Often when creating recordings in a studio, synthesizer parts expectations of such rules. are initially recorded as events rather than as audio. This One way researchers have tried to get computers to per- allows musicians to separate the recording of note events form music in an expressive way is by learning expression from the recording of synthesizer parameter change events. rules from human performances (Widmer, 2001). Perfor- After a keyboard performance has been recorded, parameter mance examples are used which include a musical score changes can be recorded to achieve the desired sound over along with a recorded human performance, and algorithms the course of the recording. In this way, musicians can create are employed to explicitly or implicitly extract performance interesting and expressive synthesizer recordings that could rules that a computer can use to perform unseen scores. not be performed live by a single musician. These studio event recordings can be used as training ex- amples for our system. If the system can learn to reproduce the desired parameter changes while a human is performing, temporally changing sounds that were previously only at- tainable in a studio can be brought to life during live perfor- mances. This method of creating training examples is natu- ral because musicians are already accustomed to recording this way, and allows them to use the synthesizers they are already familiar with. Learning with Time Series Data Figure 2: LSTM block with a single memory cell, taken from Gers, Many tasks which have been tackled with machine learning Schraudolph, and Schmidhuber (2003). The gates and cell input approaches involve time series data. Some of these tasks, activations are calculated by passing the weighted sum of incom- such as speech recognition, involve finding patterns in time ing connections through an activation function, as with a standard series data. Other tasks, such as numerous types of forecast- artificial neural network node. Input to the cell is scaled by the in- ing, involve predicting what will happen in the future based put gate’s activation, output from the cell is scaled by the output on what has happened in the past. gate’s activation, and the cell’s state is scaled by the forget gate’s For this particular problem we are concerned with predict- activation. ing the immediate future values of synthesizer parameters. These values must be predicted based on two things: Neural Network (RNN) (Jordan, 1986; Elman, 1990). RNNs • past parameter values contain nodes which retain activation values from the previ- • the current musical context of the piece being played. ous time step, and contain recurrent connections from those nodes to other nodes in the network. In this manner, data The important aspects of the musical context that affect can be fed into the network one time step per forward pass, the parameter levels are defined implicitly by a musician and the network will learn to take into account information through training examples, and so the learning system em- from past time steps when calculating output for the current ployed must be able to discover those aspects. The current time step. musical context at any time step during a musical perfor- Simple RNNs as described above are generally trained mance is dependent on events that have happened at previ- using Backpropagation Through Time (BPTT). Using this ous time steps, and so the system must have an awareness of method, errors at the current time step flow backwards the past. Which past events are important and how long they through previous time steps in order to calculate changes remain important will differ for each set of training exam- in the network’s weights. However, these errors either van- ples, and so the system must be flexible in that regard. The ish or blow up as they travel backwards in time. As a result, following sections discuss some techniques that have been simple RNNs cannot learn well when relevant events happen used to achieve this goal in other problem domains. more than 5-10 time steps in the past (Gers, Schmidhuber, and Cummins, 2000). Recurrent Neural Networks Artificial neural networks have long been useful tools in ma- LSTM Networks chine learning due to their ability to approximate non-linear Long Short-Term Memory (LSTM) RNNs overcome this functions. While the standard artificial neural network can limitation (Gers, Schmidhuber, and Cummins, 2000). be useful for some time series data learning tasks, there are LSTM networks contain memory cells which retain values limitations to the model when applied to time series data. between forward passes through the network. The networks After a standard neural network is trained, data is input also contain three types of specialized activation units called via input nodes. In each subsequent layer, node activations gates. A group of memory cells and their associated gates are calculated by applying activation functions to weighted are organized into blocks. Input gates control write access sums of the previous layer’s activation values. No infor- to the memory cells by scaling values that are to be added mation about previous activation values is stored, and so to the memory cells; output gates control read access from the network is in the same state before each forward pass. the memory cells by scaling the values that are output by the Therefore, in order for such a network to accept time series cells; forget gates allow the memory cell to periodically reset data as input it must receive a window of data, or all the by scaling the cell’s current value. The gates have weighted data after time step t − n up to time step t, where t is the connections from the input as well as recurrent connections latest time step under consideration and n is the size of the from the other gates and memory cells. During training the window. gates learn to open and close so that the memory cells can This sliding window approach has limitations because store and accumulate values, keep them for arbitrary peri- events relevant to time step t could have occurred at or be- ods of time, and use the cells’ values to affect the output as fore time step t−n, yet they will not be taken into considera- needed. tion by the network because they are outside of the window. LSTM networks have been used successfully in both An alternative to the window approach is to use a Recurrent recognizing patterns from sequences and in generating se- quences. Most notably they have proven highly effective This input scheme captures pitch, timing, and dynamics, in handwriting recognition and generation (Graves, 2013). and hypothetically provides enough information for a prop- There have been musical applications of LSTM: LSTM net- erly trained network to be able to extract any musical con- works have been used to generate musical chord progres- text that might be relevant. However, it remains to be seen sions (Eck and Schmidhuber, 2002), and for learning mu- if such training is feasible. Having one input value for each sical structure from scores (Eck and Lapalme, 2008), but key on the keyboard might require significantly larger net- these applications operate on the level of notes and chords works which introduce longer training times and additional and do not operate in real time. Our project explores how processing latency. effectively LSTM networks can be trained to implicitly ex- A third option is to explicitly provide the network with tract relevant high level musical structures from low level higher level musical features. Melodic information can time series input, and use that information to control sound be determined based on the intervals between consecu- synthesis parameters. tive notes. Expressive computer music performance sys- tems (Widmer, Flossmann, and Grachten, 2009; Arcos, Specific Challenges De Mántaras, and Serra, 1998) have had success using the Implication-Realization model (Narmour, 1990), which can In order for this system to be successful, several challenges be used to classify small local melodic structures. Such must be overcome. The learning system is to be given the information can be determined from a monophonic perfor- musical context at each time step, and so the most effec- mance, but becomes difficult if the performance is highly tive manner of encoding the current musical context must polyphonic. be determined. The system must run in real time, thus care Harmonic information can be explicitly determined if must be taken to ensure it can continually predict parame- multiple notes are being played at once. The same har- ter levels quickly enough. Generalization is an issue with monic intervals can mean different things depending on the any machine learning task, and here one must be careful not key of the piece. Therefore, to use this information as input, to overfit to the provided training examples so that the sys- the system must be trained for a specific key or the training tem can generalize well. Finally, suitable metrics must be data must be transposed to different keys during training to devised to evaluate the success of the system. achieve generalization across keys. Capturing the Musical Context Real Time Performance When this system is used, it will be given a regularly updated Musicians using this system will be playing their keyboards, state of the current musical context. This information must or whatever controllers they prefer. Every time a note is be determined from the stream of note on and note off events played, the system must predict the values of the synthesizer received from the keyboard and should provide enough con- parameters and set them before the note event is sent to the text to the learning system so that it can effectively predict synthesizer. This ensures that the parameters are set before the synthesizer parameter levels. the note is played so that the synthesizer creates the desired One very simple way to capture the context is to have a sound. Thus the system must be able to run fast enough that single input value which represents whether or not a note the total latency from the time at which the note is depressed is currently being played. When a note on event is sent to the time at which the note is played on the synthesizer is from a controller, it has two values: pitch and velocity. The within the musician’s acceptable latency tolerance. pitch value indicates which key was depressed on the key- In preliminary testing, relatively small LSTM networks board, and the velocity value indicates how hard the key have been able to complete a forward pass in approximately was struck. Given the normalized velocity v of the last note 50 microseconds on a mid-range laptop. For most mu- played, where 0 < v ≤ 1, a very simple single input x0 sicians, added latency does not become an issue until it looks like this: reaches several milliseconds. As this research progresses  0, if no note is depressed the networks are sure to grow in size, increasing the time re- x0 = quired to complete a forward pass. Because forward pass v, otherwise time is the primary contributor to added latency, the net- This input scheme can capture the timing and dynamics of works cannot grow past a certain size before use will not be the piece being played, but the pitch is ignored and it cannot satisfactory to the musician. This limit will be continually take into account polyphonic playing, where the performer evaluated as development of the system progresses. depresses more than one key at a time. If a musician merely wants parameter changes based on the timing and dynamics Generalization of monophonic playing however, this input scheme might be Any good machine learning system must be able to gener- a good choice. alize well. Generalization becomes an issue here because Another option is to have one input for each key on the it is impossible for human musicians to play something on keyboard. For each key i on the keyboard, the input vector an instrument exactly the same way twice, and often musi- element xi looks like this: cians will play the same musical phrase with slightly differ-  ent tempo, dynamics, and articulation each time for variety. 0, if key i is not depressed This system must be able to make satisfying parameter ad- xi = vi , otherwise justments when musicians play phrases that are similar to the example phrases used to train the system, but not exactly the same. Normally it is ideal to train a learning system with a large data set. Providing as many training examples as possible in- creases the generalizing power of the learned model. There are several possible ways to increase the number of training examples for this particular system. One is for the musician to play numerous examples of similar musical phrases and create parameter curves for each. However, this is a rather cumbersome task and the musician still might not be able to produce enough examples for satisfactory generalization. Another way is for the system to alter the data during training in the same ways that a human might alter the play- ing of a musical phrase during performance. This involves Figure 3: A simple triangle-shaped parameter modulation over 15 changing the duration and velocity of notes, the tempo at time steps. The network was able to learn to output this shape per- which the notes are played, and even the notes themselves. fectly after 27, 370 training epochs. It is merely an approximation Altering the examples during training is similar to gen- of a triangle due to the 16 discrete output levels. erating expressive computer music performances based on example human performances. The difference here is that these altered examples will never be heard, and thus do not ble. Different methods for capturing the musical context need to sound like authentic human performances. There is and different network topologies and parameters can be ob- generally a consistency to the way human performers make jectively compared based on the level of error minimization changes to the timing and dynamics of a musical phrase to achieved during training and, in the case of similar results, achieve an expressive performance. For example, if a per- the amount of time taken to train. These objective metrics former is playing an upward moving phrase, he or she might can be compared with subjective evaluations of performance increase the tempo and dynamics as the phrase moves up- to ensure that optimizing the objective metrics correlates wards. It could sound musically awkward to speed up and with improved subjective evaluation. slow down multiple times during the same upward moving phrase, and as such is something that an expressive perfor- Preliminary Results mance generating system would want to avoid. However, if Much of the work so far has been on developing a custom one is only concerned with creating an altered training ex- LSTM implementation and conducting training experiments ample for the sake of generalization it is not a concern if the to ensure that LSTM is a suitable learning algorithm for example does not sound musically correct as a whole. It is this problem. Two simple experiments are presented here. only important that the individual changes within the phrase The first experiment demonstrates that the LSTM implemen- are consistent with what might happen note to note during a tation presented here can learn to output a basic triangle- performance. shaped temporal parameter change on demand. The second experiment shows that LSTM is capable of learning to de- Evaluation and Measuring Success tect higher level articulation patterns from an input stream It is important to establish metrics to determine the level of and output different values based on the temporal position success of this approach. Here there are both subjective and within a pattern. objective measures of quality to consider when evaluating performance. Simple Temporal Modulation Experiment To subjectively evaluate such a system, it needs to be put This experiment was devised to determine how well an in the hands of a variety of musicians. Different musicians LSTM network can learn to output a simple triangle-shaped will have different ideas of how to use it, and different ways parameter modulation whenever a key is pressed. This shape of determining if it lives up to their expectations. After col- is shown in Figure 3. lecting initial feedback and seeing how musicians use the Rather than outputting a single continuous parameter system it will be easier to determine more specific subjec- value, the network outputs a vector y which contains 16 val- tive evaluation criteria. ues representing discrete parameter levels. The predicted As mentioned before, the system must be able to generate parameter level is selected by finding the maximum element satisfactory output when presented with musical phrases that in y. Each element of each target vector is set to 0 except for are similar to the phrases used to train the system. In some the element representing the desired parameter level which cases, what the system sees as similar and what the musician is set to 1. Training follows the backpropagation algorithm sees as similar might not agree, and what might be seen as a described in Gers, Schmidhuber, and Cummins (2000). failure of the system might be due to training examples that Because the output is discrete rather than continuous, do not properly capture the desired behavior. the output pattern is merely an approximation of a triangle Objective metrics are easier to define. As with any su- shape. pervised learning task, the general goal is to minimize train- A single network input x0 represents the current state of ing error, and to do it in as few training epochs as possi- the controller at the current time step, and has two possible values:  0, if no note is depressed x0 = 1, otherwise Each training sequence consists of 25 subsequences, dis- tributed as follows: • One subsequence containing the triangle-shaped modula- tion. Input for each time step is 1. The output starts at level 1, rises to level 16, and falls back to 1 over 15 time steps. This results in output that is a discrete approxima- tion of the triangle shape. • Ten subsequences representing silence. Input for all time steps is 0 and output for all time steps is level 1. The lengths of the silent sequences range from 9 to 11 time steps. Prior to each training epoch and each validation pass, these subsequences are shuffled so that the non- Figure 4: Illustration of the articulation experiment. Shown are silent section is in a different position within the sequence 5 consecutive staccato notes followed by one normal note. This for each pass. subsequence starts at time step 500 of a validation run after 90, 010 The topology of the LSTM hidden layer consists of seven training epochs. The parameter level stays at 1 at the onset of the memory cells, each with input, output, and forget gates. All first note, raises to 2 at the onset of the second note, and raises to 3 at the onset of the third. The parameter level remains at 3 until units in the hidden layer receive a weighted connection from most of the way through the normal note, at which point it falls the input, as well as recurrent connections from all other back down to 2 and then 1. units in the hidden layer. The memory cells are fully con- nected to the output units. Gates and output units utilize a sigmoid activation function. • Ten subsequences of silence, each with a random duration Networks with this topology are able to learn to output the of 1 to 50 time steps. Output is always level 1. desired shape perfectly over time steps with an input equal to 1. • Ten subsequences each consisting of a single normal note with a random duration of 16 to 20 time steps followed by Articulation Experiment 1 to 4 time steps of silence. Output is always level 1. For this experiment, training data is generated to simu- • Five subsequences of 3 to 5 staccato notes followed by 1 late notes being played with different articulation. Notes normal note. Output starts at level 1, increases to level 2 with two types of articulation are generated: staccato notes, at the onset of the second note, increases to level 3 at the which here last for a duration of 8 to 12 time steps followed onset of the third note, decreases to level 2 15 time steps by 8 to 12 time steps of silence, and normal notes, which into the normal note, and then decreases back to level 1 3 last for a duration of 16 to 20 time steps followed by 1 to 4 time steps later. time steps of silence. Prior to each training epoch and each validation pass, these At each time step the network is to output one of three pa- subsequences are shuffled to create a unique sequence which rameter levels as follows: Level 1 is the normal level which still contains the desired properties. is set while normal notes are being played and during ex- As in the previous experiment, the state of the controller tended periods of silence. If a staccato note is played, the is passed to the network as a single input x0 which is 0 or 1 parameter level should be increased to level 2 at the onset of depending on whether or not a key is depressed. The output the next note. If the next note is also staccato, then the level vector again represents discrete parameter levels, but in this should be increased to level 3 on the onset of the note after case only 3 levels are used. that. The parameter level will stay at level 3 as long as stac- The input vector at time step t also contains the parameter cato notes continue to be played. If a normal note is played level from time step t − 1. During training the target vector after a series of 3 or more staccato notes, the level should be from time step t − 1 is used. During validation all elements decreased to level 2 after the note has been sustained for 14 of the output vector from time step t − 1 are set to 0 except time steps, and then decreased to level 1 after 3 more time for the maximum element, which is set to 1. Feeding the steps. This behavior is illustrated in Figure 4. output from t − 1 into the network along with the current It is worth noting that it would be impossible for a net- controller state improves training accuracy dramatically. work to learn to raise the parameter level on the onset of Perfect output was achieved in this experiment as well, the first staccato note in a series of staccato notes, because using the same network topology and parameters as in the at that point it is impossible to determine for how long the previous experiment. note will be sustained. This is a limitation of operating in real time, and must be kept in mind when creating future training examples. Future Work Each training sequence consists of 25 subsequences, dis- It has been shown that LSTM networks are capable of learn- tributed as follows ing to output simple temporal changes in the presence of a stimulating input, and that LSTM networks can be trained Jordan, M. I. 1986. Serial order: A parallel distributed to recognize articulation patterns and to output parameter processing approach. Technical Report ICS Report 8604, levels based on provided examples. Future experimentation Institute for Cognitive Science, University of California, will focus on training networks to achieve tasks that depend San Diego. on pitch and dynamics as well as articulation. Kirke, A., and Miranda, E. R. 2009. A survey of computer Once satisfactory performance has been established using systems for expressive music performance. ACM Com- generated data, the system will be tested by various musi- puting Surveys 42(1):3:1–3:41. cians using their own training examples which will further expose the strengths and limitations of the system. Narmour, E. 1990. The Analysis and Cognition of Basic Melodic Structures: The Implication-Realization Model. Conclusions University of Chicago Press. Employing a computer system to automatically control Raphael, C. 2010. Music plus one and machine learning. sound synthesizer parameters during human performance In Proceedings of the 27th International Conference on is an unexplored problem that warrants continued investi- Machine Learning, ICML 10. gation. Results from initial experimentation suggest that Widmer, G.; Flossmann, S.; and Grachten, M. 2009. Yqx LSTM networks have great potential for use in solving this plays chopin. AI Magazine 30(3):35. problem. Successful application here will hopefully aid oth- Widmer, G. 2001. Discovering simple rules in complex data: ers in applying LSTM to other problems that involve contin- A meta-learning algorithm and some surprising musical uous real time sequences. discoveries. Artificial Intelligence 146:129–148. Adopting a machine learning approach to this problem al- lows for parameter control that is consistent with perform- ers’ desired aesthetics, and allows such a system to be used by musicians that do not possess computer programming skills. Machine learning applications usually learn from large aggregations of data sampled from many individuals. This project puts the teaching power directly in the hands of individuals to allow them to fully realize their visions. References Arcos, J. L.; De Mántaras, R. L.; and Serra, X. 1998. Saxex: A case-based reasoning system for generating expressive musical performances*. Journal of New Music Research 27(3):194–210. Assayag, G.; Bloch, G.; Chemillier, M.; Cont, A.; and Dub- nov, S. 2006. Omax brothers: A dynamic topology of agents for improvization learning. In Proceedings of the 1st ACM Workshop on Audio and Music Computing Mul- timedia, AMCMM ’06, 125–132. New York, NY, USA: ACM. Eck, D., and Lapalme, J. 2008. Learning musical structure directly from sequences of music. University of Montreal, Department of Computer Science, CP 6128. Eck, D., and Schmidhuber, J. 2002. Finding temporal struc- ture in music: Blues improvisation with lstm recurrent networks. In Neural Networks for Signal Processing XII, Proceedings of the 2002 IEEE Workshop, 747–756. IEEE. Elman, J. L. 1990. Finding structure in time. Cognitive Science 14:179–211. Gers, F. A.; Schmidhuber, J. A.; and Cummins, F. A. 2000. Learning to forget: Continual prediction with lstm. Neu- ral Computation 12(10):2451–2471. Gers, F. A.; Schraudolph, N. N.; and Schmidhuber, J. 2003. Learning precise timing with lstm recurrent networks. The Journal of Machine Learning Research 3:115–143. Graves, A. 2013. Generating sequences with recurrent neu- ral networks. CoRR abs/1308.0850.