<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Time-continuous Estimation of Emotion in Music with Recurrent Neural Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Thomas Pellegrini</string-name>
          <email>thomas.pellegrini@irit.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valentin Barrière</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Université de Toulouse</institution>
          ,
          <addr-line>IRIT, Toulouse</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>In this paper, we describe the IRIT's approach used for the MediaEval 2015 "Emotion in Music" task. The goal was to predict two real-valued emotion dimensions, namely valence and arousal, in a time-continuous fashion. We chose to use recurrent neural networks (RNN) for their sequence modeling capabilities. Hyperparameter tuning was performed through a 10-fold cross-validation setup on the 431 songs of the development subset. With the baseline set of 260 acoustic features, our best system achieved averaged root mean squared errors of 0.250 and 0.238, and Pearson's correlation coe cients of 0.703 and 0.692, for valence and arousal, respectively. These results were obtained by rst making predictions with an RNN comprised of only 10 hidden units, smoothed by a moving average lter, and used as input to a second RNN to generate the nal predictions. This system gave our best results on the o cial test data subset for arousal (RMSE=0.247, r=0.588), but not for Valence. Valence predictions were much worse (RMSE=0.365, r=0.029). This may be explained by the fact that in the development subset, valence and arousal values were very correlated (r=0.626), and this was not the case with the test data. Finally, slight improvements over these gures were obtained by adding spectral atness and spectral valley features to the baseline set.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Music Emotion Recognition still is a hot topic in Music
Information Retrieval. In [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], the authors list four main issues
that explain why MER is a challenging and very interesting
scienti c task: 1) ambiguity and granularity of emotion
description, 2) heavy cognitive load of emotion annotation, 3)
subjectivity of emotional perception, 4) semantic gap
between low-level acoustic features and high-level human
perception. It consists of either labeling songs and music pieces
as a whole, thus involving a classi cation task, or estimating
emotion dimensions in continuous time and space domains,
being then a regression task applied to time series. This last
case is the objective of the current challenge. For a complete
description of the task and corpus involved in challenge, the
reader may refer to [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        For continuous-space MER, many machine learning (ML)
techniques were reported in the literature [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. In the
MediaEval 2014 challenge edition, a variety of techniques were
used: simple and multi-level linear regression models [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ],
Support Vector machines for regression (SVR) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
conditional random elds [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], Long Short-Term Memory and
Recurrent Neural Networks (LSTM-RNN) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. This last
approach was the one that achieved the best results. Following
these results, we chose to use RNNs. All the ML models were
developed using the Theano toolbox [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>METHODOLOGY</title>
      <p>In order to tune and test prediction models, we ran
10fold cross-validation (CV) experiments on the development
data subset. Once the best model was selected and tuned
within this setup, a single model was trained on the whole
development subset, and used to generate predictions on the
o cial evaluation data subset.</p>
      <p>The input data were zero-mean and unit-variance
normalized. Standard PCA, PCA with a Gaussian kernel and
denoising autoencoders with Gaussian noise were tested to
further process the data, but no improvement was achieved
with any of these techniques.</p>
      <p>
        We chose to use recurrent neural networks (RNN) for their
time sequence modeling capabilities. We used the Elman [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
model type, in which recurrent connections feed the hidden
layer. The activations of the hidden layer at time t 1 are
stored and fed back to the same layer at time t together
with the data input. A tanh activation function and a
softmax function were used for the hidden layer and the nal
layer with two outputs (for arousal and valence),
respectively. The layer weights were trained with the standard
mean-root-squared cost function. Weights were updated
after each forward pass on a single song via the momentum
update rule.
      </p>
      <p>The hyperparameters were tuned with the 10-fold CV
setup. The best model was comprised of 10 hidden units,
trained with a 1.0 10 3 learning rate and a 1.0 10 2
regularization coe cient with both L1 and L2 norms. To further
limit over tting, an early stopping strategy was used: the
models were all trained with 50 iterations only. This number
of iterations was set empirically.</p>
      <p>A moving average lter was used to smooth the
predictions. Its size was tuned in the 10-fold CV setup, and the
best one was a window of 13 points. To avoid unwanted
border e ects, the rst and last 6 points, corresponding to
the lter delay, were equaled to the un ltered predictions.</p>
      <p>Another post-processing step was tested. It consisted of
feeding another RNN with the predictions of the rst RNN.
By looking at the output, one can see that the second RNN
further smoothed the predictions.</p>
      <p>
        The challenge rules also allowed to use our own acoustic
features. To complete the 260 baseline features, a set of
29 acoustic feature types were extracted with the
ESSENTIA toolbox [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which is a toolbox speci cally designed for
Music Information Retrieval. The 29 feature types such as
Bark and Erb bands that use several frequency bands
resulted in a total of 196 real values per audio frame. The
same frame rate as the baseline feature one was used (0.5s
window duration and hop size). We chose the feature types
among a large list, from the spectral domain mainly, such as
the so-called spectral "contrast", "valley", "complexity", but
also a few features from the time domain, such as
"danceability". For a complete list and description of the available
feature extraction algorithms, the reader may refer to the
ESSENTIA API documentation Web page [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        In order to select useful features, we tested each feature
type by adding them one at a time to the baseline feature
set. Only three feature types were found to improve the
baseline CV performance: two variants of spectral atness
and a feature called "spectral valley". The two spectral
atness features use two di erent frequency scales: the Bark
and the Equivalent Rectangular Bandwidth (ERB) scales.
25 Bark bands were used, as computed in ESSENTIA [
        <xref ref-type="bibr" rid="ref1 ref3">1, 3</xref>
        ].
The ERB scale consists of applying a frequency domain
lterbank using gammatone lters [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Spectral atness
provides a way to quantify how noise-like a sound is, as opposed
to being tone-like. Spectral valley is a feature derived from
the so-called spectral contrast feature, which represents the
relative spectral distribution [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. This feature was shown
to perform better than Mel frequency cepstral coe cients in
the task of music type classi cation [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>RESULTS</title>
      <p>Results are shown in Table 1, for both the cross-validation
experiments and the runs on the o cial evaluation test data
subset, referred to as 'CV' and 'Eval', respectively. The
results are reported in terms of root-mean-squared error
(RMSE) and Pearson's correlation coe cient (r).</p>
      <p>
        Generally speaking, valence predictions were less accurate
than the arousal ones, unlike the performance results
reported in the 2014's edition, as reported in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], for
example. Concerning the CV results, the simple linear
regression model (lr ) was outperformed by the RNN model with
the baseline 260 features, with 0.275 and 0.261 RMSE
values for valence, 0.254 and 0.246 for arousal, respectively.
Since the number of runs was limited, we did not submit
predictions with lr on Eval. As expected, this shows that
the sequential modeling capabilities of the RNN are useful
for this task. Adding the extra 8 features brought slight
improvement (rnn, 268feat.). Smoothing the network
predictions brought further improvement, using either 260 or
268 features as input. Finally, using the predictions as
input to a second RNN brought slight improvement too. The
best system achieved averaged RMSE of 0.250 and 0.238,
and Pearson's correlation coe cients of 0.703 and 0.692, for
valence and arousal, respectively.
      </p>
      <p>
        Concerning the Eval results, this system also gave the
best results on the o cial test data subset but for arousal
(RMSE=0.247, r=0.588) only. Valence predictions were much
worse (RMSE=0.365, r=0.029). This may be explained by
the fact that in the development subset, valence and arousal
values were very correlated (r=.626), and this was not the
case with the test data, as hypothesized by the challenge
organizers. This performance discrepancy was also observed
by the organizers that provided baseline results ('BSL')
using a linear regression model [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Only our best three arousal
predictions outperformed the BSL results signi cantly.
4.
      </p>
    </sec>
    <sec id="sec-4">
      <title>CONCLUSIONS</title>
      <p>In this paper, we described our experiments using RNNs
for the 2015 MediaEval Emotion in Music task. As expected,
the sequence modeling capabilities revealed useful for this
task since basic linear regression models were outperformed
in our cross-validation experiments. Prediction smoothing
also revealed useful. The best results were obtained when
using smoothed predictions fed into a second RNN for both
valence and arousal in our CV experiments, and only for
arousal on the o cial test set. The observed performance
discrepancy between the valence and arousal variables may
be due to the di erences between the development and test
data: valence and arousal values were very much correlated
in the development dataset, and much less in the test data
set. Concerning the acoustic feature set, slight
improvements were obtained by adding spectral atness and spectral
valley features to the baseline feature set. As future work,
we plan to further explore denoising encoders, LSTM-RNNs,
since our rst experiments with these models did not show
improvement compared to the use of basic RNNs.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] The bark frequency scale</article-title>
          . http://ccrma.stanford. edu/~jos/bbt/Bark_Frequency_Scale.html. Accessed:
          <fpage>2015</fpage>
          -08-24.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>[2] Essentia algorithm documentation web page</article-title>
          . http://essentia.upf.edu/documentation/ algorithms_reference.html. Accessed:
          <fpage>2015</fpage>
          -08-24.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>[3] The essentia bark documentation page</article-title>
          . http://essentia.upf.edu/documentation/ reference/std_BarkBands.html. Accessed:
          <fpage>2015</fpage>
          -08-24.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Aljanaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-H.</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Soleymani</surname>
          </string-name>
          .
          <article-title>Emotion in music task at mediaeval 2015</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2015 Workshop</source>
          ,
          <year>September 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bergstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Breuleux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bastien</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lamblin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pascanu</surname>
          </string-name>
          , G. Desjardins,
          <string-name>
            <given-names>J.</given-names>
            <surname>Turian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Warde-Farley</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>Theano: a cpu and gpu math expression compiler</article-title>
          .
          <source>In Proc. of the Python for scienti c computing conference (SciPy)</source>
          , volume
          <volume>4</volume>
          ,
          <string-name>
            <surname>page</surname>
            <given-names>3</given-names>
          </string-name>
          , Austin,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Bogdanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Wack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gulati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Herrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Mayor</surname>
          </string-name>
          , and et al.
          <article-title>ESSENTIA: an Audio Analysis Library for Music Information Retrieval</article-title>
          .
          <source>In Proc. International Society for Music Information Retrieval Conference (ISMIR'13)</source>
          , pages
          <fpage>493</fpage>
          {
          <fpage>498</fpage>
          ,
          <string-name>
            <surname>Curitiba</surname>
          </string-name>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Coutinho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Weninger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schuller</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Scherer</surname>
          </string-name>
          .
          <article-title>The Munich LSTM-RNN Approach to the MediaEval 2014 ^aAIJEmotion in Musica^AI_ Task</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2014 Workshop</source>
          , Barcelona,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Elman</surname>
          </string-name>
          .
          <article-title>Finding structure in time</article-title>
          .
          <source>Cognitive Science</source>
          ,
          <volume>14</volume>
          (
          <issue>2</issue>
          ),
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>V.</given-names>
            <surname>Imbrasaite</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Robinson</surname>
          </string-name>
          .
          <article-title>Music emotion tracking with continuous conditional neural elds and relative representation</article-title>
          .
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.-J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J.
          <string-name>
            <surname>-H. Tao</surname>
            , and
            <given-names>L.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Cai</surname>
          </string-name>
          .
          <article-title>Music type classi cation by spectral contrast feature</article-title>
          .
          <source>In Proc. ICME</source>
          , volume
          <volume>1</volume>
          , pages
          <fpage>113</fpage>
          {
          <fpage>116</fpage>
          ,
          <string-name>
            <surname>Lausanne</surname>
          </string-name>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          , E. Schmidt, R. igneco, O. Morton,
          <string-name>
            <given-names>P.</given-names>
            <surname>Richardson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Scott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Speck</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Turnbull</surname>
          </string-name>
          .
          <article-title>Emotion recognition: a state of the art review</article-title>
          .
          <source>In 11th International Society for Music Information and Retrieval Conference</source>
          , Utrecht,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Guha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Vaz</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. Van Segbroeck</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kim</surname>
            , and
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Narayanan</surname>
          </string-name>
          .
          <article-title>A ective feature design and predicting continuous a ective dimensions from music</article-title>
          .
          <source>In MediaEval Workshop</source>
          , Barcelona,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Moore</surname>
          </string-name>
          and
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Glasberg</surname>
          </string-name>
          .
          <article-title>Suggested formulae for calculating auditory- lter bandwidths and excitation patterns</article-title>
          .
          <source>Journal of the Acoustical Society of America</source>
          ,
          <volume>74</volume>
          :3:
          <fpage>750</fpage>
          {
          <fpage>753</fpage>
          ,
          <year>1983</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Horner</surname>
          </string-name>
          .
          <article-title>Beatsensa^AZ Solution for MediaEval 2014 Emotion in Music Task</article-title>
          .
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y.-H.</given-names>
            <surname>Yang</surname>
          </string-name>
          and
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>Music emotion recognition</article-title>
          . CRC Press,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>