Automatically Estimating Emotion in Music with
Deep Long-Short Term Memory Recurrent Neural Networks

            Eduardo Coutinho1,2 , George Trigeorgis1 , Stefanos Zafeiriou1 , Björn Schuller1
                    1
                        Department of Computing, Imperial College London, London, United Kingdom
                           2
                             School of Music, University of Liverpool, Liverpool, United Kingdom
                  {e.coutinho, g.trigeorgis, s.zafeiriou, bjoern.schuller}@imperial.ac.uk

ABSTRACT                                                         set. It consists of the official set of 65 low-level acoustic
In this paper we describe our approach for the MediaE-           descriptors (LLDs) from the 2013 INTERSPEECH Com-
val’s “Emotion in Music” task. Our method consists of            putational Paralinguistics Challenge (ComPareE; [15]), plus
deep Long-Short Term Memory Recurrent Neural Networks            their first order derivates (130 LLDs, in total). The mean
(LSTM-RNN) for dynamic Arousal and Valence regression,           and standard deviation functionals of each LLD over 1 s time
using acoustic and psychoacoustic features extracted from        windows with 50 % overlap (step size of 0.5 s) are also cal-
the songs that have been previously proven as effective for      culated in order to adapt the LLDs to the challenge require-
emotion prediction in music. Results on the challenge test       ments. This results in 260 features exported at a rate of
demonstrate an excellent performance for Arousal estima-         2 Hz. All features were extracted using openSMILE ([8]).
tion (r = 0.613 ± 0.278), but not for Valence (r = 0.026 ±       The second feature set (FS2) consists of the same acoustic
0.500). Issues regarding the quality of the test set anno-       features included in FS1, plus four features – Sensory Disso-
tations’ reliability and distributions are indicated as plau-    nance (SDiss), Roughness (R), Tempo (T), and Event Den-
sible justifications for these results. By using a subset of     sity (ED). These features correspond to two psychoacoustic
the development set that was left out for performance es-        dimensions strongly associated with the communication of
timation, we could determine that the performance of our         emotion in music and speech (e. g., [4]). SDiss and R are
approach may be underestimated for Valence (Arousal: r =         instances of Roughness, whereas T and ED are indicators of
0.596 ± 0.386; Valence: r = 0.458 ± 0.551).                      the pace of music (Duration measures). The four features
                                                                 were extracted with the MIR Toolbox [10]. For estimating
                                                                 SDiss we used Sethares ([12]) formula, and for R Vassilakis
1.   INTRODUCTION                                                algorithm ([13]). In relation to the Duration-related fea-
   The MediaEval 2015 “Emotion in Music” task comprises          tures, we used the mirtempo and mireventdensity functions
three subtasks with the goal of finding the best combination     to estimate, respectively, T and ED. T is measured in beats-
of methods and features for the time-continuous estimation       per-minute (BPM) and ED as the number of note onsets
of Arousal and Valence: Subtask 1 - Evaluating the best          per second. FS2 was submitted as the mandatory run for
feature sets for the time-continuous prediction of emotion       Subtask 1.
in music; Subtask 2 - Evaluating the best regression ap-            Regressor: LSTM-RNN Similarly to the first and last
proaches using a fixed feature set provided by the organ-        author’s approach to last year’s edition of this challenge [5],
isers; Subtask 3 - Evaluating the best overall approaches        and given the importance of the temporal context in emo-
(the choice of features and regressor is free). The devel-       tional responses to music (e. g., [4]), we considered the use
opment set consists of a subset of 431 songs used in last        of deep LSTM-RNN [9] as regressors. An LSTM-RNN net-
year’s competition (a total of 1 263) [1]. These pieces were     work is similar to an RNN except that the nonlinear hid-
selected for being annotated by at least 5 raters, and yield-    den units are replaced by a special kind of memory blocks
ing good agreement levels (Cronbach’s alpha ≥ 0.6). The          which overcome the vanishing gradient problem of RNNs.
test set comprises 58 new songs taken from freely available      Each memory block comprises one or more self-connected
databases. Unlike the development set which includes only        memory cells and three multiplicative units – input, output,
45 seconds excerpts of the original songs, the songs in the      and forget gates – which provide the cells with analogues of
test set are complete. For full details on the challenge tasks   write, read, and reset operations. The multiplicative gates
and database, please refer to [2].                               allow LSTM memory cells to store and access information
                                                                 over long sequences (and corresponding periods of time) and
                                                                 to learn a weighting profile of the contribution of other mo-
2.   METHODOLOGY                                                 ments in time for a decision at a specific moment in time.
  Feature sets. We used two features sets in our experi-         LSTM-RNN have been previously used on the context of
ments, both of which were used in the first and last authors’    time-continuous predictions of emotion in music (e.g., [5, 3,
submissions to last year’s challenge [5]. The first feature      16]).
set (FS1) is used this year by the organisers as the baseline       Models training We used a multi-task learning frame-
                                                                 work for the joint learning of Arousal and Valence time-
Copyright is held by the author/owner(s).                        continuous values. The development set was divided into
MediaEval 2015 Workshop, September 14-15, 2015, Wurzen, Ger-     11 folds using a modulus based scheme. A 10-fold cross-
many
validation procedure was used in the development phase for
parameter optimisation, and the extra fold was used to esti-     Table 1: Results on the official test set (a) and
mate the performance of our optimised model on the official      team’s test set (b). CB: challenge baseline; me14:
test set. Our basic architecture consisted of deep LSTM-         best team results in the 2014 challenge.
RNN with 3 hidden layers. Given that unsupervised pre-
training of models has been demonstrated empirically to                              Run        Arousal         Valence
help converge speed, and to guide the learning process to-                             2      0.242±0.116     0.373±0.195
wards basins of attraction of minima that lead to better                               3      0.234±0.114     0.372±0.190
                                                                           RM SE
generalisation [7], we pre-trained the first hidden layer of                           4      0.236±0.114     0.375±0.191
the model. We used an unsupervised pre-training strategy                              CB      0.270±0.110     0.366±0.180
                                                                      a)
consisting of de-noising LSTM-RNN auto-encoders (DAE,                                  2      0.611±0.254     0.004±0.505
[14]). We first created a LSTM-RNN with a single hidden                                3      0.599±0.287     0.017±0.492
                                                                              r
layer trained to predict the input features (y(t) = x(t)).                             4      0.613±0.278     0.026±0.500
In order to avoid over-fitting, in each training epoch and                            CB      0.360±0.260     0.010±0.380
timestep t, we added a noise vector n to x(t), sampled from                            2      0.206±0.128     0.212±0.116
a Gaussian distribution with zero mean and variance n. The                             3      0.221±0.119     0.185±0.119
                                                                           RM SE
development and test sets from last year’s challenge was                               4      0.220±0.121     0.183±0.110
used to train the DAE. After determining the auto-encoder                            me14     0.102±0.052     0.079±0.048
                                                                      b)
weights, the second and third hidden layers were added (and                            2      0.532±0.421     0.394±0.509
the output layer replaced by the regression variables). The                   r        3      0.596±0.386     0.458±0.551
number of memory blocks in each hidden layer (including the                            4      0.591±0.386     0.456±0.543
pre-trained layer), the learning rate (LR), and the standard                         me14     0.354±0.455     0.198±0.492
deviation of the Gaussian noise applied to the input acti-
vations (σ; used to alleviate the effects of over-fitting when
pre-training the first layer) were sequentially optimised (a     hyper-parameters and the number of networks used to esti-
momentum of 0.9 was used for all tests). An early stopping       mate Arousal and Valence were optimised using the CCC
strategy was also used to further avoid overfitting the train-   (LR = 5 ∗ 10−6 , noise σ = 0.3). In relation to Valence,
ing data – training was stopped after 20 iterations without      our models perform below the baseline on the official test
improvement of the performance (sum of squared errors) on        set (see Table 1 b)). One possible reason for this may be
the validation set. The instances in the training set of each    the low quality of the Valence annotations obtained for the
fold were presented in random order to the model. Both           test set this year (the average Cronbach’s α [6] across all
the input and output data were standardised to the mean          test pieces for Valence is 0.29). In Table 1 b) we show the
and standard deviation of the training sets in each fold. We     performance estimated on another test set consisting of a
computed 5 trials of the same model each with randomised         subset of the development set that was left out exclusively
initial weights in the range [-0.1,0.1].                         to estimate the performance of the runs submitted to the
   Runs We submitted four runs for the whole challenge.          challenge. As it can be seen, Valence predictions yield much
The specifics of each run are as follows: Run 1 consisted        better results, while the Arousal performance on the team
of the predictions of our model using the baseline features      test set is comparable to the one reached with the official
(FS1); The submitted predictions consisted of the average        test set. Furthermore, in terms of r (RM SE cannot be
over a number of LSTM-RNN outputs selected from all folds        compared), these results are noticeably better than the best
and trials. The selected folds and trials were determined by     results in last year’s challenge (me14), which can be due to
minimising the root mean squared error (RM SE) on the            the use of more reliable targets during training or the extra
small test set created to estimate the predictive power of       hidden layer added to the model. Another possibility is that
our models before submission. Run 2 was similar to Run           our models over-fitted the development data in aspects that
1 but using FS2; Run 3 was similar to Run 1, except that         are not directly visible. According to the organisers, the de-
the selected folds and trials were selected by minimising the    velopment set annotations yield a high correlation between
Concordance Correlation Coefficient (CCC) [11], which is a       Arousal and Valence, whereas the test set does not. It could
combined measure of precision (like RM SE) and similarity        thus be that, the models are picking up on this particularity
(like Pearson’s linear correlation coefficient r ).              of the development set, which gives unwanted effects for new
                                                                 music where Arousal and Valence are not correlated.
3.   RESULTS AND CONCLUSIONS                                        In future studies, apart from verifying this possibility, we
   In Table 1, we report the official challenge metrics (r and   will further investigate optimal pre-training strategies for
RM SE) calculated individually for each music piece and av-      deep LSTM-RNNs. Further, beyond the expert-given fea-
eraged across all pieces (standard deviations are also given)    ture sets employed here, we will consider opportunities of
of the challenge’s official test set (a) and the team’s test     end-to-end deep learning strategies.
set (b). The analysis of the results obtained this year in-
dicate that all our runs performed better than the base-
line for Arousal. On the official test set, runs 3 and 4         4.   ACKNOWLEDGEMENTS
led to lowest RM SE (0.234 and 0.236, respectively) and             The research leading to these results has received funding
runs 2 and 4 to the highest r (0.613). Thus, Run 3 led           from the the European Union’s Horizon 2020 research and
to the best compromise between both measures. This run           innovation programme under grant agreement no. 645378
consists of the average outputs of two LSTM-RNNs with            (ARIA-VALUSPA).
three layers (200+150+25) and FS2 as input. The model’s
5.   REFERENCES                                                Recurrent Neural Networks. In Proceedings 39th IEEE
 [1] A. Aljanaki, Y.-H. Yang, and M. Soleymani. Emotion        International Conference on Acoustics, Speech, and
     in music task at mediaeval 2014. In MediaEval 2014        Signal Processing (ICASSP), pages 5449–5453,
     Workshop, Barcelona, Spain, October 16-17 2014.           Florence, Italy, May 2014. IEEE, IEEE.
 [2] A. Aljanaki, Y.-H. Yang, and M. Soleymani. Emotion
     in music task at mediaeval 2015. In Working Notes
     Proceedings of the MediaEval 2015 Workshop,
     September 2015.
 [3] E. Coutinho, J. Deng, and B. Schuller. Transfer
     learning emotion manifestation across music and
     speech. In Proceedings of the International Joint
     Conference on Neural Networks (IJCNN), pages
     3592–3598, Beijing, China, 2014.
 [4] E. Coutinho and N. Dibben. Psychoacoustic cues to
     emotion in speech prosody and music. Cognition &
     emotion, 27(4):658–684, 2013.
 [5] E. Coutinho, F. Weninger, B. Schuller, and K. R.
     Scherer. The munich lstm-rnn approach to the
     mediaeval 2014 “emotion in music” task. In Working
     Notes Proceedings of the MediaEval 2014 Workshop,
     pages 5–6, Wurzen, Germany, 2014.
 [6] L. J. Cronbach. Coefficient alpha and the internal
     structure of tests. Psychometrika, 16:297–334, 1951.
 [7] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol,
     P. Vincent, and S. Bengio. Why does unsupervised
     pre-training help deep learning? The Journal of
     Machine Learning Research, 11:625–660, 2010.
 [8] F. Eyben, F. Weninger, F. Groß, and B. Schuller.
     Recent Developments in openSMILE, the Munich
     Open-Source Multimedia Feature Extractor. In
     Proceedings of the 21st ACM International Conference
     on Multimedia, MM 2013, pages 835–838, Barcelona,
     Spain, October 2013.
 [9] F. A. Gers, J. Schmidhuber, and F. Cummins.
     Learning to forget: Continual prediction with lstm.
     Neural computation, 12(10):2451–2471, 2000.
[10] O. Lartillot and P. Toiviainen. A matlab toolbox for
     musical feature extraction from audio. In International
     Conference on Digital Audio Effects, pages 237–244,
     2007.
[11] L. I.-K. Lin. A concordance correlation coefficient to
     evaluate reproducibility. Biometrics, 45(1):255–268,
     1989.
[12] W. A. Sethares. Tuning, timbre, spectrum, scale,
     volume 2. Springer, 2005.
[13] P. Vassilakis. Auditory roughness estimation of
     complex spectra – roughness degrees and dissonance
     ratings of harmonic intervals revisited. The Journal of
     the Acoustical Society of America, 110(5):2755–2755,
     2001.
[14] P. Vincent, Y. B. Larochelle, and P. A. Manzagol.
     Extracting and composing robust features with
     denoising autoencoders. In Proceedings of the 25th
     International Conference on Machine
     Learning(ICML’08), pages 1096–1103. ACM, 2008.
[15] F. Weninger, F. Eyben, B. W. Schuller, M. Mortillaro,
     and K. R. Scherer. On the Acoustics of Emotion in
     Audio: What Speech, Music and Sound have in
     Common. Frontiers in Psychology, 4(Article ID
     292):1–12, May 2013.
[16] F. J. Weninger, F. Eyben, and B. Schuller. On-Line
     Continuous-Time Music Mood Regression with Deep