Predicting Mozart’s Next Note via Echo State Networks Ąžuolas Krušna, Mantas Lukoševičius Faculty of Informatics Kaunas University of Technology Kaunas, Lithuania azukru@ktu.edu, mantas.lukosevicius@ktu.lt Abstract—Even though algorithmic music has been around the Rather than working with sound signals, we chose to work world since the old days, it has never attracted as many with notes for several reasons. Firstly, it is a lot less intricate. researchers as in the recent years. To our knowledge it existed in Therefore, it is a lot easier for us to understand and analyse it as Iran back in the Middle Ages and in Europe during the Age of well as it is for the algorithm in the means of computational Enlightenment. Though the form has changed and it has grown resources and dependency on previous notes. In the note level layers of complexity, the very foundations of the algorithm that we are also able to compare it with the musical theory. And it generates musical compositions have not changed, i.e. most of helps us stay in the realm of classical music as well. them are based on structures of fortuity. Additionally, models that are able to learn have been discovered allowing us to imitate the Furthermore, we are lucky enough to have the MIDI music of the incredible artists throughout history. The thought (musical instrument digital interface) protocol for a .mid is a alone is crazy to think of and seems to be from the sci-fi. In this musical file format that captures the notes, the times piano keys paper, a research trying to find the best model of an echo state were pressed and released, how strong they were pressed etc. network in order to mimic the music of the legendary Wolfgang MIDI supports 128 notes whereas general pianos usually Amadeus Mozart has been carried out. As it turns out, the best provide 88 keys. models are the ones that rely on long-term dependencies. Musical composition has been one of the long term goals of Keywords—algorithtmic composition, echo state network, MIDI, artificial intelligence (AI) [3]. Broadly speaking, music recurrent neural network generation by AI is based on the principle that musical styles are in effect complex systems of probabilistic relationships, as I. INTRODUCTION defined by the musicologist Leonard B. Meyer. In the early days, Algorithmic music is by no means a new trend in our techy symbolic AI methods and specific grammars describing a set of world. In fact, three Iranian brothers collectively known as Banu rules had driven the composition [4], [5]. Then these methods Musa were successfully devising automatic and even were significantly improved by evolutionary algorithms in a programmable musical instruments back in 850 AD [1]. They variety of ways [6] as represented by the famous EMI project were most likely invited to the best parties in the city back then. [7]. More recently, statistics in the form of Markov chains and Moreover, an algorithmic game circulated around Europe since hidden Markov models (HMM) played a major part in the Enlightenment Age, i.e. the 18th century in a form of algorithmic composition [8]. Next to this development was the Musikalishes Würfelspiel. It has been attributed to Mozart in a rapid rise of neural networks (NN) due to the growing capacity form of myth, yet never proven to be true. This game took small of computational powers. It has made a remarkable process not fragments of music and combined them in a random order by only in the AI world but also in music composition [9]. chance, often tossing a dice [2]. Since then, the scope of the As music is a sequence of notes, a sequential model was algorithmic music has augmented layers of complexity, but the chosen to train on Mozart’s music. Markov models are not very foundations have not changed. The main difference is that now suitable for this task due to their monophony (although it is we do not toss a dice, but run a random number generation possible to design a system for polyphonic music as well). function in our favourite programming language. Currently, the cutting-edge approach to generative music In this article we are trying to imitate classical piano music. modelling is based on recurrent networks [4], [10], [11] like the For a quantitative rather than qualitative analysis only one long short-term memory (LSTM) network. Traditional recurrent composer was chosen. Mozart has been opted for his neural networks (RNN) lack long-term dependency, thus are indisputable genius and some haphazardness. able to generate melody yet no harmony, i.e. the music gets stuck at some point or turns out to be repetitive. LSTMs are better in this case since they have a stronger long-term dependency. Though fine-tuned LSTM algorithms are able to overcome the Copyright held by the author(s). obstacles that traditional RNN algorithms confront, they still face the same problems in a way that the music lacks the theme, 84 i.e. the big picture. Long short-term memory algorithms have algorithm (Fig. 1) that was applied for treatment of raw MIDI been extensively studied in the recent years. Besides, LSTM files looks as following: algorithms are also heavy and require a lot of resources. We have Fig. 1. Algorithm of raw MIDI file treatment into a CSV file been looking for a light-weight solution. The programming language of choice was Python due to its For these reasons, we chose to work with a type of recurrent recognition in data science and machine learning among neural networks – echo state network (ESN) – that have barely scientists and developers. Also, due to the many data processing been researched for musical composition. as well as machine learning libraries although none of the II. DATA & TOOLS machine learning libraries were used for this work. For the purpose of .mid processing, Mido library was chosen [12]. Musical data were downloaded in the format of .mid from Machine learning algorithms perform better under more the website http://www.piano-midi.de/. From now on MIDI and data. We could have just taken in all of the composers from the .mid will be used interchangeably meaning the same, i.e. the file website full of classical MIDI files, but we chose only one for format unless stated otherwise, e.g. MIDI protocol. In total, 21 the purpose of thorough analysis. Despite the fact that our pieces by Mozart were gathered (all that are found on the choice was only Mozart’s music and that had given us only 21 website). pieces of scores, this resulted in around 68 thousand notes. MIDI format is a sequence of notes (and commands such as tempo change and sound perturbations) whereas the time III. INITIAL DATA ANALYSIS difference is represented in ticks. A quarter note is usually 480 Prior to the research, an analysis of the data was performed or 960 ticks but that depends on the resolution. Thus, a full note based on the distribution of note pitches as well as their lengths. or, in other words, a tact is 1920 or 3840 ticks respectively. As we can clearly see in Fig. 2, there are 2 maximums. One is of Later on, the data had to be transformed in a format that is a higher pitch while the other is of a quite lower pitch. This is easier to read, maintain and process. Hence, it was read and most probably due to the fact that piano is played by 2 hands and transformed into notes as messages into a .csv (comma that the left hand usually wanders in the region of lower pitch separated values) format. notes whilst the right hand sits in the region of higher pitch notes. Every message consists of information of this type:  note pitch  on tick  off tick  length The length parameter is not in the MIDI file and had been artificially generated for the purpose of data analysis. Table I shows the types of information as well as their ranges in a message. Note pitch ranges from 0 to 127, thus a byte is more than enough to store it. The beginning and the end of a note tick is undetermined and can grow to infinity when the data grows. Length parameter is purely the difference between Fig. 2. Distribution of Mozart notes. One can obviously spot 2 maximums of a higher and a lower pitch. This is most likely due to the fact that piano is played on and off ticks. It may grow to a large number due to software by 2 hands and that the left hand usually wanders in the region of lower pitch bugs or a divergence of the algorithm, but usually it shall stay notes whilst the right hand sits in the region of higher pitches. in the realm of classical music and get a value up to a full note. These data are not so much relevant for our research, but TABLE I. INFORMATION INSIDE A MESSAGE provide us with insights such as it would make perfect sense to Info Note pitch On tick Off tick Length study the hands in more detail. We ought to bolster our research either by adding an additional dimension of the hand or by Type byte long long integer having 2 different outputs for each hand by the network. This Range 0-127 0-infinity 1-infinity 1-full note analysis is also useful for future comparison and judgment of A message in MIDI that signifies the event of pressing a note generated music. is the note_on message. It represents an event when a note is released as well, only the velocity then is equal to zero. The Analysis of lengths (Fig. 3) provide us only one maximum, meaning both hands share the same maximum or that the note iterate through the messages: lengths of one hand are very dispersed. --check if it is a ‘note_on’ type of message: ----if velocity > 0: ------take the time of the note that was pressed ----else if velocity equals 0: ------check if the actual note was pressed: --------release the note --------measure the length of the note ----------append the note as a message to the CSV 85 In order to avoid overfitting, regularization is used. The number of neurons inside the reservoir has been opted be equal to 1000. Programming code for ESN has been adapted from [15] and expanded for multidimensional input data as well as output. V. EXPERIMENTAL SETUP Research has been accomplished in a manner that can be seen in Fig. 6. First of all, music was accumulated in .mid format (hex code). As stated before, it was processed by Mido library and stored in a .csv format in a form of messages that carry the information of notes as the pitch number, on and off ticks and Fig. 3. Distribution of Mozart note lengths. length. IV. NETWORK Then the messages were read from the .csv file and quantized. Quantization was performed for the beginning and Echo state networks supply an architecture and principles of the end of the notes in the following way. A quantization unit of supervised learning for recurrent neural networks. The idea 60 ticks (represents a 32nd of a note) was chosen. Next, if the behind an ESN is to drive a large, random and fixed reservoir of residual value of the tick was less than half the quantization unit, neurons with the input signal (Fig. 4). Thence, inducing each it was reduced by the residual. If the residual value was equal or neuron within it with a nonlinear response signal. After, higher than half of the quantization unit, i.e. 30 ticks, it was combine the desirable output data by a trainable linear increased by the difference between the quantization unit and the combination of all of these response signals [13]. In practice, it residual. The lengths of notes were recalculated afterwards. is important to keep in mind that the reservoir acts not only as a nonlinear expansion, but also as a memory input at the same time In Fig. 5 we can see the distribution of the notes after [14]. quantization. Hereby, the number of notes of the length of the quant (60 ticks) has increased. The most frequent note stayed the Echo state network may be tuned by altering the following same (120 ticks). Also, a tiny part of the very shortest notes was parameters: quantized to zero length, thus, eliminated.  leaking rate  input scaling  spectral radius Leaking rate of the network can be regarded as the speed of the reservoir update dynamics in discrete time. Another key parameter to optimize an ESN is the input scaling. It multiplies the input weight matrix Win by its value either strengthening the input weights or diminishing them. Fig. 5. Lengths of quantized notes whereas the quantization unit is 60 ticks. As a further step, these quantized music messages were turned into a state matrix of length that is equal to the division of the total length of the pieces by the quantization unit rounded to integer. Another dimension of the state matrix were the note pitches, that is 128 values in total. Then the value at each time Fig. 4. Design of an echo state network [14]. Here u is the input data, Win is the step at a certain note represents its state (1 for pressed and 0 for input weights matrix, x is the reservoir nodes and their outputs, W is their not pressed). 80% of the data were sent to the echo state network weights, Wout is the output weights and y is the output data. whilst 20% were used for validation of the model, thus finding out the error. Error was calculated in the shape of root mean Spectral radius is one of the most global parameters of an squared error (RMSE). ESN, i.e. the maximum absolute eigenvalue of the reservoir weights matrix W. It scales the matrix W, or in alternate words, An ESN was generated according to given parameters. This scales the width of the distribution of its nonzero elements [14]. ESN was then trained on input and predicted music based on its 86 learned weights as a one time-step prediction. The training process was initialized by 300 time steps, that is by 300 quants (60 ticks). To find out the best parameters for our echo state network, we would repeat the procedure of generating the network according to different parameters and training the new network model on the very same data. Then we predicted next notes based on the newly gained weights and found out the error by comparing with the original Mozart data. Prediction of notes was a sequel of the training process. To be more precise, the model predicted notes as a one time-step prediction. Summarizing, a grid search analysis of 4 parameters of the echo state network has been performed. Parameters that have been investigated for tuning our network are the following. Leaking rate, input scaling, spectral radius and regularization which are the most important ESN parameters explained in Section IV. Since their ranges usually go from 0 to 1, 0 to 2, 0 to 2 and almost anything respectively, they have been tested for values in these ranges. An exhaustive grid search analysis had been performed looking for the best parameters. In addition to RMSE, mean and standard deviation were calculated. Original Mozart music had the mean of 0.04238 and standard deviation of 0.0779. Mean represents the probability of a note to played at each time step in the note spectrum. In Mozart’s case note spectrum is from the 29th to the 91st note. Standard deviation represents the mean of standard deviations of the notes in the note spectrum. Leaking rate has been tested from 0.0025 to 1, spectral radius varied from 0.0015 to 2 in this test, input scaling from 2*10-6 to 2 and regularization from 10-6 to 105. Fig. 6. Scheme of research. Music is processed from the .mid format into .csv format. Then quantized and transformed into state matrix. 80% of the data are fed to the network while 20% are compared to the predicted data from the trained network model generated with given parameters. Lastly, the errors for given parameters are printed out. VI. RESULTS As we can see from the sorted by error (top 10) Table II, the lowest value of error (RMSE) is a tiny bit above 0.0307. It is clear that the best leaking rate for our model is about 0.025 while the combination of input scaling and spectral radius vary a little bit. Input scaling goes from 0.002 to 0.0002 and spectral radius from 0.01 to 0.1. We can notice that while RMSE is the lowest, the mean of the notes is about the same of the quantized original Mozart music data mean but standard deviation is quite different. reg stand for regularization, rmse stands for RMSE and std stands for standard deviation in the tables of error (Table II, Table III, Table IV). TABLE II. SORTED ERROR DATA (TOP10) leaking input spectral reg mean rmse std rate scaling radius 0.025 0.0002 0.1 0.0001 0.0410 0.030769 0.05696 0.025 0.0005 0.01 0.001 0.0411 0.030771 0.05699 0.025 0.0005 0.01 0.001 0.0411 0.030771 0.05698 87 0.025 0.0002 0.02 0.0001 0.041 0.030772 0.05697 0.25 2 0.8 0.1 0.0405 0.043094 0.06198 0.03 0.0005 0.1 0.001 0.0411 0.030772 0.05699 0.25 2 1.4 0.01 0.0405 0.043098 0.06199 0.025 0.0005 0.05 0.001 0.0411 0.030773 0.05699 0.25 2 1.4 1 0.0406 0.043128 0.06173 0.03 0.0005 0.06 0.001 0.0411 0.030774 0.05699 0.25 2 1.4 0.1 0.0406 0.043139 0.06175 0.02 0.0006 0.02 0.001 0.0411 0.030775 0.05697 0.25 2 1.4 0.01 0.0406 0.043142 0.06175 0.015 0.0006 0.02 0.001 0.0411 0.030776 0.05697 0.25 2 1.4 10 0.0406 0.043169 0.06171 0.02125 0.002 0.1 0.01 0.0410 0.030776 0.05697 0.25 0.8 1.4 1 0.0413 0.043178 0.06141 Having low leaking rate suggests us that the state has a lot of 0.25 0.8 1.4 0.1 0.0413 0.043234 0.06145 inertia and the change of the state is slow. Input scaling scales 0.25 0.8 1.4 0.01 0.0413 0.04324 0.0615 the Win matrix, thus, the input weights are very low and the model depends on its input just a tiny bit. Since it is lower than Fig. 7 shows us the minimum error dependency on leaking the spectral radius, it has a lot of memory, i.e. follows a long- rate. It is worth to note that although leaking rate 0.25 yields term dependency. Having low spectral radius as well tells us that worst results when regularization is not high, it may also yield the models are almost linear. To summarize, the prediction very good results with other values of ESN parameters as can function is not very complex and the model has a lot of memory. be seen in Fig. 7. From Table III we see that a high regularization value gives us huge errors. It has to be noted that for this particular grid search step, the maximum value of input scaling and spectral radius was 0.2. Thus, we can also deduce that high input scaling values lead to higher error. Though leaking rate is not as important, we can still see that some of its higher values lead to higher errors. High regularization significantly reduces the mean value and standard deviation of the notes. TABLE III. SORTED ERROR DATA (WORST10) leaking input spectral reg mean rmse std Fig. 7. Minimum RMSE dependency on leaking rate. rate scaling radius 0.175 0.2 0.2 100000 0.0345 0.064397 0.00959 The errors were grouped by leaking rate and the minimum 0.175 0.2 0.14 100000 0.0357 0.064656 0.00952 value of the error was taken to plot the dependency graph. In 0.1 0.2 0.02 100000 0.035 0.064968 0.00801 Fig. 8 we can see the most promising region of leaking rate for 0.1 0.2 0.2 100000 0.0352 0.065097 0.00817 our echo state network. 0.1 0.2 0.14 100000 0.0353 0.065158 0.00806 0.1 0.2 0.08 100000 0.0354 0.065258 0.00808 0.025 0.2 0.02 100000 0.035 0.066049 0.00572 0.025 0.2 0.14 100000 0.0352 0.0662 0.00584 0.025 0.2 0.08 100000 0.0353 0.066243 0.00585 0.025 0.2 0.2 100000 0.0354 0.066295 0.00582 In order for us to see tendencies beyond regularization, we filtered the data for regularization below or equals 10. This brought us back to the maximum values of input scaling, spectral radius and leaking rate. In Table IV we see that high input scaling produces high error once again. Interestingly, leaking rate stays at 0.25 for the Fig. 8. Zoomed minimum RMSE dependency on leaking rate. highest error. Although spectral radius stays quite high, it is not of the highest value for the highest error. Mean is almost as with Fig. 9 shows us the minimum error dependency on input the best results. Standard deviation is higher in this case than scaling whereas Fig. 10 zooms us to the most promising region with the best results. It is even closer to quantized Mozart’s of input scaling. The best values of input scaling are 0.0002 and music standard deviation than the one provided with the best 0.0005. Going even lower, the values increase dramatically. results. TABLE IV. SORTED ERROR DATA (WORST10) WHILE REGULARIZATION IS SMALLER OR EQUAL TO 10 leaking input spectral reg mean rmse std rate scaling radius 0.25 2 0.8 1 0.0405 0.043065 0.06195 88 Fig. 9. Minimum RMSE dependency on input scaling. Fig. 12. Zoomed minimum RMSE dependency on spectral radius. Fig. 10. Zoomed minimum RMSE dependency on input scaling. Fig. 13. Zoomed minimum RMSE dependency on spectral radius to the most promising region. Fig. 11 shows us the minimum error dependency on spectral radius. From Fig. 12 and Fig. 13 we can see that the minimum Fig. 14 and Fig. 15 implies us that the best regularization RMSE stabilizes and reaches the minimum on spectral radius values are of the power 10-4 to 10-2. below 0.1. Then starts growing again above 0.01. Fig. 14. Minimum RMSE dependency on regularization. Fig. 11. Minimum RMSE dependency on spectral radius. Fig. 15. Zoomed minimum RMSE regularization on dependency. 89 Since it was a lot easier to find the optimal leaking rate than it has to follow a long-term dependency because the input input scaling and spectral radius, we grouped the errors by input scaling is lower than the spectral radius. Having low spectral scaling and spectral radius taking the minimum RMSE value in radius as well implies that the prediction function ought to be Fig. 16. Regularization is an additional parameter that prevents quite simple because the reservoir operates in an almost linear overfitting and we have not grouped by it. It was quite easy to regime. find as well. VIII. FUTURE WORK Our main aim is to produce good music so that people would like to listen to it. To achieve this goal, we analysed the best models to replicate Mozart’s music. Lately, we have been planning to include information of piano hands into our composition model. In the future work we would like to expand the dimensions of this research since MIDI files have additional information such as the velocity of the pressed note as well as tempo changes and sound perturbations. We are also eager to expand this study for more great composers and then tune our models to not only imitate but also generate new music that people would value. If echo state networks do not prove to be deep enough, we are determined to broaden our research Fig. 16. Grid search minimum error (RMSE) grouped by input scaling and spectral radius. including deep learning models such as hierarchies of regular recurrent neural networks or long short-term memory networks As it was found out that the most optimal leaking rate is and other recurrent types. We could then compare them and 0.025, the errors were grouped by input scaling and spectral possibly combine the best parts of them. We are hoping that the radius once again by a set leaking rate. Now they we grouped artificial network is able to learn the rules or tendencies of having leaking rate set to 0.025. In Fig. 17 we can see pointy music theory implicitly, at least partially. If this is not the case, triangles in the lower region where the errors are the lowest. we could augment it with heuristics. REFERENCES [1] MIDI history: chapter 1 – 850 AD TO 1850 AD, https://www.midi.org/articles/midi-history-chapter-1, accessed on March 2018. [2] Cope, D. (1996). Experiments in musical intelligence. A-R Editions, Inc. [3] Z., Sun, et al., “Composing music with grammar argumented neural networks and note-level encoding,” arXiv preprint arXiv:1611.05416v2, 2016. [4] G. M. Rader, “A method for composing simple traditional music by computer,” Communications of the ACM, vol. 17, no. 11, pp. 631–638, 1974. [5] J. D. Fernandez and F. Vico, “AI methods in algorithmic composition: a Fig. 17. Grid search minimum error (RMSE) grouped by input scaling and comprehensive survey,” Journal of Artificial Intelligence Research, vol. spectral radius while leaking rate equals 0.025. 48, no. 48, pp. 513–582, 2013. [6] K. Thywissen, “Genotator: an environment for exploring the application It has to be taken into account that producing an even finer of evolutionary techniques in computer-assisted composition,” Organised grid might give us even better results but this takes time. Also, Sound, vol. 4, no. 2, pp. 127–133, 1999. it seems from all this analysed data that the reduction in error [7] D. Cope, “Computer modeling of musical intelligence in EMI,” Computer Music Journal, vol. 16, no. 16, pp. 69–87, 1992. would be quite low. [8] M. Allan, “Harmonising chorales in the style of Johann Sebastian Bach,” Master’s Thesis, School of Informatics, University of Edinburgh, 2002. VII. CONCLUSIONS [9] D. Silver, et al., “Mastering the game of Go with deep neural networks We can affirm that the best value of leaking rate in our and tree search.” Nature, vol. 529, no. 7587, pp. 484–489, 2016. research proved to be 0.025. The best values of input scaling are [10] H. Chu, R. Urtasun, S. Fidler, “Song from PI: A Musically Plausible 0.0005 and 0.0002 whereas the most optimal values of spectral Network for Pop Music Generation”, 2016, radius and regularization vary from 0.1 to 0.01 and from 10-4 to https://arxiv.org/abs/1611.03477H, accessed on March 2018. 10-2 respectively. Having said that, the values will not produce [11] A. Huang, R. Wu, “Deep learning for music,” 2016, the best results in separation, they will only produce the best https://arxiv.org/abs/1606.04930, accessed on March 2018. results in a proper combination with other variables as it can be [12] Mido – MIDI objects for Python, https://mido.readthedocs.io, accessed on March 2018. seen in the tables and figures. [13] H., Jaeger “Echo state network,” Scholarpedia, To summarize our research, we can state that to predict http://www.scholarpedia.org/article/Echo_state_network, accessed on Mozart’s music, one has to memorize a lot of the notes in order March 2018. to predict the next note. In the terms of our echo state network, 90 [14] M., Lukoševičius, “A Practical Guide to Applying Echo State Networks,” [15] Sample echo state network souce codes, http://minds.jacobs- Neural Networks Tricks of the Trade, 2nd e., Springer, 2012 university.de/mantas/code, accessed on March 2018. 91