<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automatically Estimating Emotion in Music with Deep Long-Short Term Memory Recurrent Neural Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Eduardo Coutinho</string-name>
          <email>e.coutinho@imperial.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>George Trigeorgis</string-name>
          <email>g.trigeorgis@imperial.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefanos Zafeiriou</string-name>
          <email>s.zafeiriou@imperial.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Björn Schuller</string-name>
          <email>bjoern.schuller@imperial.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computing, Imperial College London</institution>
          ,
          <addr-line>London</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Music, University of Liverpool</institution>
          ,
          <addr-line>Liverpool</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>In this paper we describe our approach for the MediaEval's \Emotion in Music" task. Our method consists of deep Long-Short Term Memory Recurrent Neural Networks (LSTM-RNN) for dynamic Arousal and Valence regression, using acoustic and psychoacoustic features extracted from the songs that have been previously proven as e ective for emotion prediction in music. Results on the challenge test demonstrate an excellent performance for Arousal estimation (r = 0:613 0:278), but not for Valence (r = 0:026 0:500). Issues regarding the quality of the test set annotations' reliability and distributions are indicated as plausible justi cations for these results. By using a subset of the development set that was left out for performance estimation, we could determine that the performance of our approach may be underestimated for Valence (Arousal: r = 0:596 0:386; Valence: r = 0:458 0:551).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>The MediaEval 2015 \Emotion in Music" task comprises
three subtasks with the goal of nding the best combination
of methods and features for the time-continuous estimation
of Arousal and Valence: Subtask 1 - Evaluating the best
feature sets for the time-continuous prediction of emotion
in music; Subtask 2 - Evaluating the best regression
approaches using a xed feature set provided by the
organisers; Subtask 3 - Evaluating the best overall approaches
(the choice of features and regressor is free). The
development set consists of a subset of 431 songs used in last
year's competition (a total of 1 263) [1]. These pieces were
selected for being annotated by at least 5 raters, and
yielding good agreement levels (Cronbach's alpha 0.6). The
test set comprises 58 new songs taken from freely available
databases. Unlike the development set which includes only
45 seconds excerpts of the original songs, the songs in the
test set are complete. For full details on the challenge tasks
and database, please refer to [2].</p>
    </sec>
    <sec id="sec-2">
      <title>METHODOLOGY</title>
      <p>Feature sets. We used two features sets in our
experiments, both of which were used in the rst and last authors'
submissions to last year's challenge [5]. The rst feature
set (FS1) is used this year by the organisers as the baseline
set. It consists of the o cial set of 65 low-level acoustic
descriptors (LLDs) from the 2013 INTERSPEECH
Computational Paralinguistics Challenge (ComPareE; [15]), plus
their rst order derivates (130 LLDs, in total). The mean
and standard deviation functionals of each LLD over 1 s time
windows with 50 % overlap (step size of 0.5 s) are also
calculated in order to adapt the LLDs to the challenge
requirements. This results in 260 features exported at a rate of
2 Hz. All features were extracted using openSMILE ([8]).
The second feature set (FS2) consists of the same acoustic
features included in FS1, plus four features { Sensory
Dissonance (SDiss), Roughness (R), Tempo (T), and Event
Density (ED). These features correspond to two psychoacoustic
dimensions strongly associated with the communication of
emotion in music and speech (e. g., [4]). SDiss and R are
instances of Roughness, whereas T and ED are indicators of
the pace of music (Duration measures). The four features
were extracted with the MIR Toolbox [10]. For estimating
SDiss we used Sethares ([12]) formula, and for R Vassilakis
algorithm ([13]). In relation to the Duration-related
features, we used the mirtempo and mireventdensity functions
to estimate, respectively, T and ED. T is measured in
beatsper-minute (BPM) and ED as the number of note onsets
per second. FS2 was submitted as the mandatory run for
Subtask 1.</p>
      <p>Regressor: LSTM-RNN Similarly to the rst and last
author's approach to last year's edition of this challenge [5],
and given the importance of the temporal context in
emotional responses to music (e. g., [4]), we considered the use
of deep LSTM-RNN [9] as regressors. An LSTM-RNN
network is similar to an RNN except that the nonlinear
hidden units are replaced by a special kind of memory blocks
which overcome the vanishing gradient problem of RNNs.
Each memory block comprises one or more self-connected
memory cells and three multiplicative units { input, output,
and forget gates { which provide the cells with analogues of
write, read, and reset operations. The multiplicative gates
allow LSTM memory cells to store and access information
over long sequences (and corresponding periods of time) and
to learn a weighting pro le of the contribution of other
moments in time for a decision at a speci c moment in time.
LSTM-RNN have been previously used on the context of
time-continuous predictions of emotion in music (e.g., [5, 3,
16]).</p>
      <p>Models training We used a multi-task learning
framework for the joint learning of Arousal and Valence
timecontinuous values. The development set was divided into
11 folds using a modulus based scheme. A 10-fold
crossvalidation procedure was used in the development phase for
parameter optimisation, and the extra fold was used to
estimate the performance of our optimised model on the o cial
test set. Our basic architecture consisted of deep
LSTMRNN with 3 hidden layers. Given that unsupervised
pretraining of models has been demonstrated empirically to
help converge speed, and to guide the learning process
towards basins of attraction of minima that lead to better
generalisation [7], we pre-trained the rst hidden layer of
the model. We used an unsupervised pre-training strategy
consisting of de-noising LSTM-RNN auto-encoders (DAE,
[14]). We rst created a LSTM-RNN with a single hidden
layer trained to predict the input features (y(t) = x(t)).
In order to avoid over- tting, in each training epoch and
timestep t, we added a noise vector n to x(t), sampled from
a Gaussian distribution with zero mean and variance n. The
development and test sets from last year's challenge was
used to train the DAE. After determining the auto-encoder
weights, the second and third hidden layers were added (and
the output layer replaced by the regression variables). The
number of memory blocks in each hidden layer (including the
pre-trained layer), the learning rate (LR), and the standard
deviation of the Gaussian noise applied to the input
activations ( ; used to alleviate the e ects of over- tting when
pre-training the rst layer) were sequentially optimised (a
momentum of 0.9 was used for all tests). An early stopping
strategy was also used to further avoid over tting the
training data { training was stopped after 20 iterations without
improvement of the performance (sum of squared errors) on
the validation set. The instances in the training set of each
fold were presented in random order to the model. Both
the input and output data were standardised to the mean
and standard deviation of the training sets in each fold. We
computed 5 trials of the same model each with randomised
initial weights in the range [-0.1,0.1].</p>
      <p>Runs We submitted four runs for the whole challenge.
The speci cs of each run are as follows: Run 1 consisted
of the predictions of our model using the baseline features
(FS1); The submitted predictions consisted of the average
over a number of LSTM-RNN outputs selected from all folds
and trials. The selected folds and trials were determined by
minimising the root mean squared error (RM SE) on the
small test set created to estimate the predictive power of
our models before submission. Run 2 was similar to Run
1 but using FS2; Run 3 was similar to Run 1, except that
the selected folds and trials were selected by minimising the
Concordance Correlation Coe cient (CCC) [11], which is a
combined measure of precision (like RM SE) and similarity
(like Pearson's linear correlation coe cient r ).</p>
    </sec>
    <sec id="sec-3">
      <title>RESULTS AND CONCLUSIONS</title>
      <p>In Table 1, we report the o cial challenge metrics (r and
RM SE) calculated individually for each music piece and
averaged across all pieces (standard deviations are also given)
of the challenge's o cial test set (a) and the team's test
set (b). The analysis of the results obtained this year
indicate that all our runs performed better than the
baseline for Arousal. On the o cial test set, runs 3 and 4
led to lowest RM SE (0.234 and 0.236, respectively) and
runs 2 and 4 to the highest r (0.613). Thus, Run 3 led
to the best compromise between both measures. This run
consists of the average outputs of two LSTM-RNNs with
three layers (200+150+25) and FS2 as input. The model's
hyper-parameters and the number of networks used to
estimate Arousal and Valence were optimised using the CCC
(LR = 5 10 6, noise = 0:3). In relation to Valence,
our models perform below the baseline on the o cial test
set (see Table 1 b)). One possible reason for this may be
the low quality of the Valence annotations obtained for the
test set this year (the average Cronbach's [6] across all
test pieces for Valence is 0.29). In Table 1 b) we show the
performance estimated on another test set consisting of a
subset of the development set that was left out exclusively
to estimate the performance of the runs submitted to the
challenge. As it can be seen, Valence predictions yield much
better results, while the Arousal performance on the team
test set is comparable to the one reached with the o cial
test set. Furthermore, in terms of r (RM SE cannot be
compared), these results are noticeably better than the best
results in last year's challenge (me14), which can be due to
the use of more reliable targets during training or the extra
hidden layer added to the model. Another possibility is that
our models over- tted the development data in aspects that
are not directly visible. According to the organisers, the
development set annotations yield a high correlation between
Arousal and Valence, whereas the test set does not. It could
thus be that, the models are picking up on this particularity
of the development set, which gives unwanted e ects for new
music where Arousal and Valence are not correlated.</p>
      <p>In future studies, apart from verifying this possibility, we
will further investigate optimal pre-training strategies for
deep LSTM-RNNs. Further, beyond the expert-given
feature sets employed here, we will consider opportunities of
end-to-end deep learning strategies.</p>
    </sec>
    <sec id="sec-4">
      <title>ACKNOWLEDGEMENTS</title>
      <p>The research leading to these results has received funding
from the the European Union's Horizon 2020 research and
innovation programme under grant agreement no. 645378
(ARIA-VALUSPA).</p>
    </sec>
    <sec id="sec-5">
      <title>5. REFERENCES</title>
      <p>[1] A. Aljanaki, Y.-H. Yang, and M. Soleymani. Emotion
in music task at mediaeval 2014. In MediaEval 2014
Workshop, Barcelona, Spain, October 16-17 2014.
[2] A. Aljanaki, Y.-H. Yang, and M. Soleymani. Emotion
in music task at mediaeval 2015. In Working Notes
Proceedings of the MediaEval 2015 Workshop,</p>
      <p>September 2015.
[3] E. Coutinho, J. Deng, and B. Schuller. Transfer
learning emotion manifestation across music and
speech. In Proceedings of the International Joint
Conference on Neural Networks (IJCNN), pages
3592{3598, Beijing, China, 2014.
[4] E. Coutinho and N. Dibben. Psychoacoustic cues to
emotion in speech prosody and music. Cognition &amp;
emotion, 27(4):658{684, 2013.
[5] E. Coutinho, F. Weninger, B. Schuller, and K. R.</p>
      <p>Scherer. The munich lstm-rnn approach to the
mediaeval 2014 \emotion in music" task. In Working
Notes Proceedings of the MediaEval 2014 Workshop,
pages 5{6, Wurzen, Germany, 2014.
[6] L. J. Cronbach. Coe cient alpha and the internal
structure of tests. Psychometrika, 16:297{334, 1951.
[7] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol,
P. Vincent, and S. Bengio. Why does unsupervised
pre-training help deep learning? The Journal of
Machine Learning Research, 11:625{660, 2010.
[8] F. Eyben, F. Weninger, F. Gro , and B. Schuller.</p>
      <p>Recent Developments in openSMILE, the Munich
Open-Source Multimedia Feature Extractor. In
Proceedings of the 21st ACM International Conference
on Multimedia, MM 2013, pages 835{838, Barcelona,
Spain, October 2013.
[9] F. A. Gers, J. Schmidhuber, and F. Cummins.</p>
      <p>Learning to forget: Continual prediction with lstm.</p>
      <p>Neural computation, 12(10):2451{2471, 2000.
[10] O. Lartillot and P. Toiviainen. A matlab toolbox for
musical feature extraction from audio. In International
Conference on Digital Audio E ects, pages 237{244,
2007.
[11] L. I.-K. Lin. A concordance correlation coe cient to
evaluate reproducibility. Biometrics, 45(1):255{268,
1989.
[12] W. A. Sethares. Tuning, timbre, spectrum, scale,
volume 2. Springer, 2005.
[13] P. Vassilakis. Auditory roughness estimation of
complex spectra { roughness degrees and dissonance
ratings of harmonic intervals revisited. The Journal of
the Acoustical Society of America, 110(5):2755{2755,
2001.
[14] P. Vincent, Y. B. Larochelle, and P. A. Manzagol.</p>
      <p>Extracting and composing robust features with
denoising autoencoders. In Proceedings of the 25th
International Conference on Machine</p>
      <p>Learning(ICML'08), pages 1096{1103. ACM, 2008.
[15] F. Weninger, F. Eyben, B. W. Schuller, M. Mortillaro,
and K. R. Scherer. On the Acoustics of Emotion in
Audio: What Speech, Music and Sound have in
Common. Frontiers in Psychology, 4(Article ID
292):1{12, May 2013.
[16] F. J. Weninger, F. Eyben, and B. Schuller. On-Line
Continuous-Time Music Mood Regression with Deep</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Recurrent</given-names>
            <surname>Neural</surname>
          </string-name>
          <article-title>Networks</article-title>
          .
          <source>In Proceedings 39th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)</source>
          , pages
          <fpage>5449</fpage>
          {
          <fpage>5453</fpage>
          ,
          <string-name>
            <surname>Florence</surname>
          </string-name>
          , Italy, May
          <year>2014</year>
          . IEEE, IEEE.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>