Automatically Estimating Emotion in Music with Deep Long-Short Term Memory Recurrent Neural Networks Eduardo Coutinho1,2 , George Trigeorgis1 , Stefanos Zafeiriou1 , Björn Schuller1 1 Department of Computing, Imperial College London, London, United Kingdom 2 School of Music, University of Liverpool, Liverpool, United Kingdom {e.coutinho, g.trigeorgis, s.zafeiriou, bjoern.schuller}@imperial.ac.uk ABSTRACT set. It consists of the official set of 65 low-level acoustic In this paper we describe our approach for the MediaE- descriptors (LLDs) from the 2013 INTERSPEECH Com- val’s “Emotion in Music” task. Our method consists of putational Paralinguistics Challenge (ComPareE; [15]), plus deep Long-Short Term Memory Recurrent Neural Networks their first order derivates (130 LLDs, in total). The mean (LSTM-RNN) for dynamic Arousal and Valence regression, and standard deviation functionals of each LLD over 1 s time using acoustic and psychoacoustic features extracted from windows with 50 % overlap (step size of 0.5 s) are also cal- the songs that have been previously proven as effective for culated in order to adapt the LLDs to the challenge require- emotion prediction in music. Results on the challenge test ments. This results in 260 features exported at a rate of demonstrate an excellent performance for Arousal estima- 2 Hz. All features were extracted using openSMILE ([8]). tion (r = 0.613 ± 0.278), but not for Valence (r = 0.026 ± The second feature set (FS2) consists of the same acoustic 0.500). Issues regarding the quality of the test set anno- features included in FS1, plus four features – Sensory Disso- tations’ reliability and distributions are indicated as plau- nance (SDiss), Roughness (R), Tempo (T), and Event Den- sible justifications for these results. By using a subset of sity (ED). These features correspond to two psychoacoustic the development set that was left out for performance es- dimensions strongly associated with the communication of timation, we could determine that the performance of our emotion in music and speech (e. g., [4]). SDiss and R are approach may be underestimated for Valence (Arousal: r = instances of Roughness, whereas T and ED are indicators of 0.596 ± 0.386; Valence: r = 0.458 ± 0.551). the pace of music (Duration measures). The four features were extracted with the MIR Toolbox [10]. For estimating SDiss we used Sethares ([12]) formula, and for R Vassilakis 1. INTRODUCTION algorithm ([13]). In relation to the Duration-related fea- The MediaEval 2015 “Emotion in Music” task comprises tures, we used the mirtempo and mireventdensity functions three subtasks with the goal of finding the best combination to estimate, respectively, T and ED. T is measured in beats- of methods and features for the time-continuous estimation per-minute (BPM) and ED as the number of note onsets of Arousal and Valence: Subtask 1 - Evaluating the best per second. FS2 was submitted as the mandatory run for feature sets for the time-continuous prediction of emotion Subtask 1. in music; Subtask 2 - Evaluating the best regression ap- Regressor: LSTM-RNN Similarly to the first and last proaches using a fixed feature set provided by the organ- author’s approach to last year’s edition of this challenge [5], isers; Subtask 3 - Evaluating the best overall approaches and given the importance of the temporal context in emo- (the choice of features and regressor is free). The devel- tional responses to music (e. g., [4]), we considered the use opment set consists of a subset of 431 songs used in last of deep LSTM-RNN [9] as regressors. An LSTM-RNN net- year’s competition (a total of 1 263) [1]. These pieces were work is similar to an RNN except that the nonlinear hid- selected for being annotated by at least 5 raters, and yield- den units are replaced by a special kind of memory blocks ing good agreement levels (Cronbach’s alpha ≥ 0.6). The which overcome the vanishing gradient problem of RNNs. test set comprises 58 new songs taken from freely available Each memory block comprises one or more self-connected databases. Unlike the development set which includes only memory cells and three multiplicative units – input, output, 45 seconds excerpts of the original songs, the songs in the and forget gates – which provide the cells with analogues of test set are complete. For full details on the challenge tasks write, read, and reset operations. The multiplicative gates and database, please refer to [2]. allow LSTM memory cells to store and access information over long sequences (and corresponding periods of time) and to learn a weighting profile of the contribution of other mo- 2. METHODOLOGY ments in time for a decision at a specific moment in time. Feature sets. We used two features sets in our experi- LSTM-RNN have been previously used on the context of ments, both of which were used in the first and last authors’ time-continuous predictions of emotion in music (e.g., [5, 3, submissions to last year’s challenge [5]. The first feature 16]). set (FS1) is used this year by the organisers as the baseline Models training We used a multi-task learning frame- work for the joint learning of Arousal and Valence time- Copyright is held by the author/owner(s). continuous values. The development set was divided into MediaEval 2015 Workshop, September 14-15, 2015, Wurzen, Ger- 11 folds using a modulus based scheme. A 10-fold cross- many validation procedure was used in the development phase for parameter optimisation, and the extra fold was used to esti- Table 1: Results on the official test set (a) and mate the performance of our optimised model on the official team’s test set (b). CB: challenge baseline; me14: test set. Our basic architecture consisted of deep LSTM- best team results in the 2014 challenge. RNN with 3 hidden layers. Given that unsupervised pre- training of models has been demonstrated empirically to Run Arousal Valence help converge speed, and to guide the learning process to- 2 0.242±0.116 0.373±0.195 wards basins of attraction of minima that lead to better 3 0.234±0.114 0.372±0.190 RM SE generalisation [7], we pre-trained the first hidden layer of 4 0.236±0.114 0.375±0.191 the model. We used an unsupervised pre-training strategy CB 0.270±0.110 0.366±0.180 a) consisting of de-noising LSTM-RNN auto-encoders (DAE, 2 0.611±0.254 0.004±0.505 [14]). We first created a LSTM-RNN with a single hidden 3 0.599±0.287 0.017±0.492 r layer trained to predict the input features (y(t) = x(t)). 4 0.613±0.278 0.026±0.500 In order to avoid over-fitting, in each training epoch and CB 0.360±0.260 0.010±0.380 timestep t, we added a noise vector n to x(t), sampled from 2 0.206±0.128 0.212±0.116 a Gaussian distribution with zero mean and variance n. The 3 0.221±0.119 0.185±0.119 RM SE development and test sets from last year’s challenge was 4 0.220±0.121 0.183±0.110 used to train the DAE. After determining the auto-encoder me14 0.102±0.052 0.079±0.048 b) weights, the second and third hidden layers were added (and 2 0.532±0.421 0.394±0.509 the output layer replaced by the regression variables). The r 3 0.596±0.386 0.458±0.551 number of memory blocks in each hidden layer (including the 4 0.591±0.386 0.456±0.543 pre-trained layer), the learning rate (LR), and the standard me14 0.354±0.455 0.198±0.492 deviation of the Gaussian noise applied to the input acti- vations (σ; used to alleviate the effects of over-fitting when pre-training the first layer) were sequentially optimised (a hyper-parameters and the number of networks used to esti- momentum of 0.9 was used for all tests). An early stopping mate Arousal and Valence were optimised using the CCC strategy was also used to further avoid overfitting the train- (LR = 5 ∗ 10−6 , noise σ = 0.3). In relation to Valence, ing data – training was stopped after 20 iterations without our models perform below the baseline on the official test improvement of the performance (sum of squared errors) on set (see Table 1 b)). One possible reason for this may be the validation set. The instances in the training set of each the low quality of the Valence annotations obtained for the fold were presented in random order to the model. Both test set this year (the average Cronbach’s α [6] across all the input and output data were standardised to the mean test pieces for Valence is 0.29). In Table 1 b) we show the and standard deviation of the training sets in each fold. We performance estimated on another test set consisting of a computed 5 trials of the same model each with randomised subset of the development set that was left out exclusively initial weights in the range [-0.1,0.1]. to estimate the performance of the runs submitted to the Runs We submitted four runs for the whole challenge. challenge. As it can be seen, Valence predictions yield much The specifics of each run are as follows: Run 1 consisted better results, while the Arousal performance on the team of the predictions of our model using the baseline features test set is comparable to the one reached with the official (FS1); The submitted predictions consisted of the average test set. Furthermore, in terms of r (RM SE cannot be over a number of LSTM-RNN outputs selected from all folds compared), these results are noticeably better than the best and trials. The selected folds and trials were determined by results in last year’s challenge (me14), which can be due to minimising the root mean squared error (RM SE) on the the use of more reliable targets during training or the extra small test set created to estimate the predictive power of hidden layer added to the model. Another possibility is that our models before submission. Run 2 was similar to Run our models over-fitted the development data in aspects that 1 but using FS2; Run 3 was similar to Run 1, except that are not directly visible. According to the organisers, the de- the selected folds and trials were selected by minimising the velopment set annotations yield a high correlation between Concordance Correlation Coefficient (CCC) [11], which is a Arousal and Valence, whereas the test set does not. It could combined measure of precision (like RM SE) and similarity thus be that, the models are picking up on this particularity (like Pearson’s linear correlation coefficient r ). of the development set, which gives unwanted effects for new music where Arousal and Valence are not correlated. 3. RESULTS AND CONCLUSIONS In future studies, apart from verifying this possibility, we In Table 1, we report the official challenge metrics (r and will further investigate optimal pre-training strategies for RM SE) calculated individually for each music piece and av- deep LSTM-RNNs. Further, beyond the expert-given fea- eraged across all pieces (standard deviations are also given) ture sets employed here, we will consider opportunities of of the challenge’s official test set (a) and the team’s test end-to-end deep learning strategies. set (b). The analysis of the results obtained this year in- dicate that all our runs performed better than the base- line for Arousal. On the official test set, runs 3 and 4 4. ACKNOWLEDGEMENTS led to lowest RM SE (0.234 and 0.236, respectively) and The research leading to these results has received funding runs 2 and 4 to the highest r (0.613). Thus, Run 3 led from the the European Union’s Horizon 2020 research and to the best compromise between both measures. This run innovation programme under grant agreement no. 645378 consists of the average outputs of two LSTM-RNNs with (ARIA-VALUSPA). three layers (200+150+25) and FS2 as input. The model’s 5. REFERENCES Recurrent Neural Networks. In Proceedings 39th IEEE [1] A. Aljanaki, Y.-H. Yang, and M. Soleymani. Emotion International Conference on Acoustics, Speech, and in music task at mediaeval 2014. In MediaEval 2014 Signal Processing (ICASSP), pages 5449–5453, Workshop, Barcelona, Spain, October 16-17 2014. Florence, Italy, May 2014. IEEE, IEEE. [2] A. Aljanaki, Y.-H. Yang, and M. Soleymani. Emotion in music task at mediaeval 2015. In Working Notes Proceedings of the MediaEval 2015 Workshop, September 2015. [3] E. Coutinho, J. Deng, and B. Schuller. Transfer learning emotion manifestation across music and speech. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), pages 3592–3598, Beijing, China, 2014. [4] E. Coutinho and N. Dibben. Psychoacoustic cues to emotion in speech prosody and music. Cognition & emotion, 27(4):658–684, 2013. [5] E. Coutinho, F. Weninger, B. Schuller, and K. R. Scherer. The munich lstm-rnn approach to the mediaeval 2014 “emotion in music” task. In Working Notes Proceedings of the MediaEval 2014 Workshop, pages 5–6, Wurzen, Germany, 2014. [6] L. J. Cronbach. Coefficient alpha and the internal structure of tests. Psychometrika, 16:297–334, 1951. [7] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, and S. Bengio. Why does unsupervised pre-training help deep learning? The Journal of Machine Learning Research, 11:625–660, 2010. [8] F. Eyben, F. Weninger, F. Groß, and B. Schuller. Recent Developments in openSMILE, the Munich Open-Source Multimedia Feature Extractor. In Proceedings of the 21st ACM International Conference on Multimedia, MM 2013, pages 835–838, Barcelona, Spain, October 2013. [9] F. A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: Continual prediction with lstm. Neural computation, 12(10):2451–2471, 2000. [10] O. Lartillot and P. Toiviainen. A matlab toolbox for musical feature extraction from audio. In International Conference on Digital Audio Effects, pages 237–244, 2007. [11] L. I.-K. Lin. A concordance correlation coefficient to evaluate reproducibility. Biometrics, 45(1):255–268, 1989. [12] W. A. Sethares. Tuning, timbre, spectrum, scale, volume 2. Springer, 2005. [13] P. Vassilakis. Auditory roughness estimation of complex spectra – roughness degrees and dissonance ratings of harmonic intervals revisited. The Journal of the Acoustical Society of America, 110(5):2755–2755, 2001. [14] P. Vincent, Y. B. Larochelle, and P. A. Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning(ICML’08), pages 1096–1103. ACM, 2008. [15] F. Weninger, F. Eyben, B. W. Schuller, M. Mortillaro, and K. R. Scherer. On the Acoustics of Emotion in Audio: What Speech, Music and Sound have in Common. Frontiers in Psychology, 4(Article ID 292):1–12, May 2013. [16] F. J. Weninger, F. Eyben, and B. Schuller. On-Line Continuous-Time Music Mood Regression with Deep