<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multi-scale Approaches to the MediaEval 2015 “Emotion in Music" Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mingxing Xu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xinxing Li</string-name>
          <email>lixinxing1991@126.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Haishu Xianyu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jiashen Tian</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fanhang Meng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wenxiao Chen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Key Laboratory of Pervasive Computing, Ministry of Education Tsinghua National Laboratory for Information Science and Technology (TNList) Department of Computer Science and Technology, Tsinghua University</institution>
          ,
          <addr-line>Beijing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>The goal of the \Emotion in Music" task in MediaEval 2015 is to automatically estimate the emotions expressed by music (in terms of Arousal and Valence) in a time-continuous fashion. In this paper, considering the high context correlation among the music feature sequence, we study several multiscale approaches at different levels, including acoustic feature learning with Deep Brief Networks (DBNs) followed a modi ed Autoencoder (AE), bi-directional Long-Short Term Memory Recurrent Neural Networks (BLSTM-RNNs) based multi-scale regression fusion with Extreme Learning Machine (ELM), and hierarchical prediction with Support Vector Regression (SVR). The evaluation performances of all runs submitted are signi cantly better than the baseline provided by the organizers, illustrating the effectiveness of the proposed approaches.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The MediaEval 2015 \Emotion in Music" has only one
task dynamic emotion characterization, including two
required runs (one for feature extraction with linear
regression, another for regression model with the baseline feature
set provided by the organizers) and up to other three runs
(any combination of the features and machine learning
techniques) to permit a thorough comparison between different
methods. In the task this year, the development data
contains 431 clips with the best annotation agreement selected
from the last year data and the evaluation data consists of
58 full-length songs. For more details, please refer to [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        In order to predict and trace the evolution of music
emotion more precisely, we investigated several multi-scale
methods implemented at three different levels, including acoustic
feature level, regression model level and emotion annotation
level. For acoustic feature level, features were organized
in groups according to their time scales and fundamentals.
Deep learning algorithm was used to learn new features
integrated multi-scale information about music emotion.
Inspired by BLSTM-RNNs' capability of mapping sequence to
sequence [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], we trained some BLSTM-RNNs with different
length sequences and fused them using the extreme
learning machine to produce the nal prediction. In addition, we
proposed a hierarchical regression to predict the global trend
and local uctuation of music dynamic emotion separately.
2.
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>METHODOLOGY</title>
    </sec>
    <sec id="sec-3">
      <title>Feature Learning</title>
      <p>
        We used openSMILE toolbox to extract 65 Low-Level
Descriptors (LLDs) with con guration IS13_ComParE_lld (see
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] for details) and divided them into 3 groups as follows: A)
26 LLDs related to audSpec; B) 29 LLDs related with
pcmfftMag and Mel-Frequency Cepstral Coefficient (MFCC); C)
10 LLDs related to voice. In addition, we adopted the idea
proposed in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to extract Compressibility (comp), Spectral
Centre of MASS (SCOM) and Median Spectral Band Energy
(MSBE) at the local scale, and used MIR Toolbox [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to
extract 20 other features related to music attributes, including
dynamic RMS energy, Tempo, Event Density, Spectrum
centroid, Flatness, Irregularity, Skewness, Kurtosis, Rolloff85,
Rolloff95, Spread, Brightness, Roughness, Entropy, Spectral
Flux, Zero crossing rate, HCDF, Key mode, Key clarity and
Chromagram centroid, and then assembled them as group
D. The frame size was 60 ms for group C and 25 ms for other
groups. In all groups, overlapping windows were used with
a 10 ms step.
      </p>
      <p>For features of each group in 1 s window with 0.5 s overlap,
we calculated the mean, STD, slope and Shannon entropy
functionals, delta coefficients together with the STD and
slope functionals, and acceleration coefficients together with
the STD functionals. This resulted in 4 feature sets with
dimension 182, 203, 70 and 161, respectively.</p>
      <p>
        Four different Deep Belief Networks (DBNs) were used to
learn the higher representation for each group features
independently, which were then fused by a special Autoencoder
with a modi ed cost function considering sparse and
heterogeneous entropy (details described in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]), to produce the
nal features at a rate of 2 Hz for the succeeding regression.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Multi-scale BLSTM-RNNs Fusion</title>
      <p>2.2.1</p>
      <sec id="sec-4-1">
        <title>Models training</title>
        <p>
          Considering the high context correlation among the
music emotion feature sequence, we used bi-directional Long
Short-Term Memory recurrent neural networks
(BLSTMRNNs) which worked quite well on numerous tasks involving
sequence modeling in recent years [
          <xref ref-type="bibr" rid="ref10 ref2 ref7 ref8">10, 7, 2, 8</xref>
          ], to predict
dynamic music emotion.
        </p>
        <p>Separate BSLTM-RNNs were trained for arousal and
valence regression. BLSTM-RNNs with 5 hidden layers (250
units per layer and direction) were used. The rst two
layers were pre-trained with whole development set (431 clips)
and test set (58 songs). Training with learning rate 5E-6
was stopped after a maximum of 100 iterations or after 20
iterations without improving the validation set error. To
alleviate over- tting, Gaussian noise with zero mean and
standard deviation 0.6 was added to the input activations, and
sequences were presented in random order during training.
All BLSTM-RNNs were trained with CURRENNT 1.</p>
        <p>We trained 4 kinds of BLSTM-RNNs with different time
scale (i.e. sequence length) of 60, 30, 20 and 10, respectively,
on a training set containing 411 clips, and validated them
on the remained 20 clips selected randomly according to
the genre distribution of the test data (i.e. 58 complete
songs). We totally made 5 different data partitions (411 clips
for training, 20 clips for validation) and computed 3 trials
of the same model each with randomized initial weights,
among which the best one was selected. Hence, there were
5 different BLSTM-RNNs for each time scale.
2.2.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Model selection and fusion</title>
        <p>In order to select 4 models with different time scales for
fusion, we applied two different criteria separately to compose
two groups of 4 models. The rst criterion was RMSE- rst
which just selected the model with the best RMSE for each
time scale, while the second criterion was considering both
the RMSE and the data partition to guarantee the training
sets of the selected models for fusion be different from each
other. In our experiments, there were 2 models shared by
two groups; in other words, there were 6 unique models for
fusion.</p>
        <p>
          At the fusion step, we averaged the predictions produced
by all 6 models as the nal result. In addition to this
simple fusion policy, we trained an Extreme Learning Machine
(ELM) [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] for fusion. The input feature vector of ELM
consisted of the original predictions of 4 different time scale
BLSTM-RNNs, their delta derivatives and the smoothed
values generated through a triangle- lter with length of 50.
Two separate ELMs were constructed to fuse the
corresponding predictions of the two model groups mentioned
above. Finally, the outputs of two ELMs were averaged to
produce the nal emotion prediction.
2.3
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Hierarchical Regression</title>
      <p>
        The aim of hierarchical regression is to predict the global
trend and local uctuation of music dynamic emotion
separately. Firstly, a global Support Vector Regression (SVR)
was built to predict the mean of dynamic emotion attributes
of whole song with 6373 song-level global features extracted
using OpenSMILE toolbox with IS13_ComParE con guration
(see [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] for details). Then, OpenSMILE with con guration
IS13_ComParE_lld was used to extract 130 segment-level
features whose means and standard deviations were
calculated with 1s window and 0.5s shift to form local features
to predict the uctuation of dynamic emotion attributes for
each 0.5s clip by a local SVR. Finally, each uctuation value
predicted by the local SVR and the mean value predicted by
the global SVR were added to form the nal emotion
prediction for the corresponding 0.5s clip.
      </p>
    </sec>
    <sec id="sec-6">
      <title>RUNS AND EVALUATION RESULTS</title>
      <p>We submitted four runs for the task this year. The speci cs
of each run are as follows: Run 1) Multi-scale BLSTM-RNNs
based Fusion with the simple average policy was performed
1https://sourceforge.net/p/currennt
with the baseline feature set. Run 2) Same as Run 1, but
using ELMs for fusion. Run 3) Same as Run 1, but using
new features learnt with the method described in Section
2.1. In all above runs, test data was segmented into
xedlength clips with 50% overlap according to the time-scale of
BLSTM-RNNs speci ed. Run 4) The SVR based
hierarchical regression described in Section 2.3.</p>
      <p>In Table 1, we report the official evaluation metrics (r
- Pearson correlation coefficient; and RMSE - Root Mean
Squared Error). The results showed that all runs were
signi cantly better than the baseline result. Considering the
comprehensive performance, we observed that Run 2 was
the best one. However, Run 2 was not better than Run 1
consistently, which indicated that ELMs might be trained
insufficiently. Both Runs 1 and 2 worked particularly well
in r, which was attributed to the BLSTM-RNNs'
capability of mapping sequence to sequence. The reason why the
new features in Run 3 did not make an expected
improvement might be the low level features were not appropriate
to represent different time-scales. Although the method in
Run 4 was simple, it delivered comparable RMSE and r for
Arousal among all runs, and performed quite well for
Valence, but only in RMSE not in r, which may be related to
the decomposition of global trend and local uctuation. We
believe it is a promising algorithm.
4.</p>
    </sec>
    <sec id="sec-7">
      <title>CONCLUSIONS</title>
      <p>We describe THU-HCSIL teams approaches to the
Emotion in Music task at MediaEval 2015. Several multi-scale
approaches at three levels have been compared with the
baseline system, including acoustic feature learning,
multiregression fusion and hierarchical prediction of emotion
features. The results show that the proposed methods are
signi cantly better than the baseline system, illustrating the
effectiveness of the multi-scale approaches. In future work,
we plan to investigate how to select the time scale
automatically and systematically. In addition, the audio les of
the test data in the pre-training stage of submitted Runs
1{3 may limit the generalizability of the trained model and
some more evaluations are needed.</p>
    </sec>
    <sec id="sec-8">
      <title>ACKNOWLEDGEMENTS</title>
      <p>This work was partially supported by the National
Natural Science Foundation of China (No.61171116, 61433018),
the National High Technology Research and Development
Program of China (863 Program) (No. 2015AA016305).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Aljanaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-H.</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Soleymani</surname>
          </string-name>
          .
          <article-title>Emotion in music task at mediaeval 2015</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2015 Workshop</source>
          ,
          <year>September 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y. C.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. L.</given-names>
            <surname>Xie</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F. K.</given-names>
            <surname>Soong</surname>
          </string-name>
          .
          <article-title>Tts synthesis with bidirectional lstm based recurrent neural networks</article-title>
          .
          <source>In The 15th Annual Conference of the International Speech Communication Association (INTERSPEECH)</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          .
          <article-title>Framewise phoneme classi cation with bidirectional lstm and other neural network architectures</article-title>
          .
          <source>Neural Networks</source>
          ,
          <volume>18</volume>
          (
          <issue>5</issue>
          {6):
          <volume>602</volume>
          {
          <fpage>610</fpage>
          ,
          <year>June 2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Siew</surname>
          </string-name>
          .
          <article-title>Extreme learning machine: theory and applications</article-title>
          . Neurocomputing,
          <volume>70</volume>
          (
          <issue>1</issue>
          ):
          <volume>489</volume>
          {
          <fpage>501</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Guha</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Vaz</surname>
          </string-name>
          .
          <article-title>Affective feature design and predicting continuous affective dimensions from music</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2014 Workshop</source>
          ,
          <year>October 2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>O.</given-names>
            <surname>Lartillot</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Toiviainen</surname>
          </string-name>
          .
          <article-title>A matlab toolbox for music feature extraction from audio</article-title>
          .
          <source>In International Conference on Digital Audio Effects</source>
          , pages
          <volume>237</volume>
          {
          <fpage>244</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Sak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Senior</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Beaufays</surname>
          </string-name>
          .
          <article-title>Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition</article-title>
          .
          <source>arXiv preprint arXiv:1402.1128</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and H.</given-names>
            <surname>Meng</surname>
          </string-name>
          .
          <article-title>Voice conversion using deep bidirectional long short-term memory based recurrent neural networks</article-title>
          .
          <source>In International Conference on Acoustics, Speech, and Signal Processing (ICASSP)</source>
          , pages
          <fpage>4869</fpage>
          {
          <fpage>4873</fpage>
          ,
          <string-name>
            <surname>April</surname>
          </string-name>
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F.</given-names>
            <surname>Weninger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Eyben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. W.</given-names>
            <surname>Schuller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mortillaro</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Scherer</surname>
          </string-name>
          .
          <article-title>On the acoustics of emotion in audio: What speech, music and sound have in common</article-title>
          . Frontiers in Psychology,
          <volume>4</volume>
          (
          <string-name>
            <surname>Article</surname>
            <given-names>ID</given-names>
          </string-name>
          292):
          <volume>1</volume>
          {
          <fpage>12</fpage>
          , May
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wollmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Weninger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schuller</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Rigoll</surname>
          </string-name>
          .
          <article-title>Feature enhancement by bidirectional lstm networks for conversational speech recognition in highly non-stationary noise</article-title>
          .
          <source>In International Conference on Acoustics, Speech, and Signal Processing (ICASSP)</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Xu</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Xianyu</surname>
          </string-name>
          .
          <article-title>Heterogeneity-entropy based unsupervised feature learning for personality prediction with cross-media data</article-title>
          . submitted to The
          <source>Thirtieth AAAI Conference on Arti cial Intelligence</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>