=Paper= {{Paper |id=Vol-1436/Paper77 |storemode=property |title=Multi-Scale Approaches to the MediaEval 2015 ``Emotion in Music" Task |pdfUrl=https://ceur-ws.org/Vol-1436/Paper77.pdf |volume=Vol-1436 |dblpUrl=https://dblp.org/rec/conf/mediaeval/XuLXTMC15 }} ==Multi-Scale Approaches to the MediaEval 2015 ``Emotion in Music" Task== https://ceur-ws.org/Vol-1436/Paper77.pdf
              Multi-scale Approaches to the MediaEval 2015
                         “Emotion in Music" Task

     Mingxing Xu, Xinxing Li, Haishu Xianyu, Jiashen Tian, Fanhang Meng, Wenxiao Chen
                          Key Laboratory of Pervasive Computing, Ministry of Education
                   Tsinghua National Laboratory for Information Science and Technology (TNList)
                Department of Computer Science and Technology, Tsinghua University, Beijing, China
                       xumx@tsinghua.edu.cn, {lixinxing1991, xyhs2010}@126.com

ABSTRACT                                                         2.    METHODOLOGY
The goal of the “Emotion in Music” task in MediaEval 2015
is to automatically estimate the emotions expressed by music     2.1     Feature Learning
(in terms of Arousal and Valence) in a time-continuous fash-        We used openSMILE toolbox to extract 65 Low-Level De-
ion. In this paper, considering the high context correlation     scriptors (LLDs) with configuration IS13_ComParE_lld (see
among the music feature sequence, we study several multi-        [9] for details) and divided them into 3 groups as follows: A)
scale approaches at different levels, including acoustic fea-     26 LLDs related to audSpec; B) 29 LLDs related with pcm-
ture learning with Deep Brief Networks (DBNs) followed a         fftMag and Mel-Frequency Cepstral Coefficient (MFCC); C)
modified Autoencoder (AE), bi-directional Long-Short Term        10 LLDs related to voice. In addition, we adopted the idea
Memory Recurrent Neural Networks (BLSTM-RNNs) based              proposed in [5] to extract Compressibility (comp), Spectral
multi-scale regression fusion with Extreme Learning Ma-          Centre of MASS (SCOM) and Median Spectral Band Energy
chine (ELM), and hierarchical prediction with Support Vec-       (MSBE) at the local scale, and used MIR Toolbox [6] to ex-
tor Regression (SVR). The evaluation performances of all         tract 20 other features related to music attributes, including
runs submitted are significantly better than the baseline pro-   dynamic RMS energy, Tempo, Event Density, Spectrum cen-
vided by the organizers, illustrating the effectiveness of the    troid, Flatness, Irregularity, Skewness, Kurtosis, Rolloff85,
proposed approaches.                                             Rolloff95, Spread, Brightness, Roughness, Entropy, Spectral
                                                                 Flux, Zero crossing rate, HCDF, Key mode, Key clarity and
                                                                 Chromagram centroid, and then assembled them as group
1. INTRODUCTION                                                  D. The frame size was 60 ms for group C and 25 ms for other
   The MediaEval 2015 “Emotion in Music” has only one            groups. In all groups, overlapping windows were used with
task dynamic emotion characterization, including two re-         a 10 ms step.
quired runs (one for feature extraction with linear regres-         For features of each group in 1 s window with 0.5 s overlap,
sion, another for regression model with the baseline feature     we calculated the mean, STD, slope and Shannon entropy
set provided by the organizers) and up to other three runs       functionals, delta coefficients together with the STD and
(any combination of the features and machine learning tech-      slope functionals, and acceleration coefficients together with
niques) to permit a thorough comparison between different         the STD functionals. This resulted in 4 feature sets with
methods. In the task this year, the development data con-        dimension 182, 203, 70 and 161, respectively.
tains 431 clips with the best annotation agreement selected         Four different Deep Belief Networks (DBNs) were used to
from the last year data and the evaluation data consists of      learn the higher representation for each group features inde-
58 full-length songs. For more details, please refer to [1].     pendently, which were then fused by a special Autoencoder
   In order to predict and trace the evolution of music emo-     with a modified cost function considering sparse and hetero-
tion more precisely, we investigated several multi-scale meth-   geneous entropy (details described in [11]), to produce the
ods implemented at three different levels, including acoustic     final features at a rate of 2 Hz for the succeeding regression.
feature level, regression model level and emotion annotation
level. For acoustic feature level, features were organized       2.2     Multi-scale BLSTM-RNNs Fusion
in groups according to their time scales and fundamentals.
Deep learning algorithm was used to learn new features in-       2.2.1    Models training
tegrated multi-scale information about music emotion. In-           Considering the high context correlation among the mu-
spired by BLSTM-RNNs’ capability of mapping sequence to          sic emotion feature sequence, we used bi-directional Long
sequence [3], we trained some BLSTM-RNNs with different           Short-Term Memory recurrent neural networks (BLSTM-
length sequences and fused them using the extreme learn-         RNNs) which worked quite well on numerous tasks involving
ing machine to produce the final prediction. In addition, we     sequence modeling in recent years [10, 7, 2, 8], to predict
proposed a hierarchical regression to predict the global trend   dynamic music emotion.
and local fluctuation of music dynamic emotion separately.          Separate BSLTM-RNNs were trained for arousal and va-
                                                                 lence regression. BLSTM-RNNs with 5 hidden layers (250
                                                                 units per layer and direction) were used. The first two lay-
Copyright is held by the author/owner(s).                        ers were pre-trained with whole development set (431 clips)
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany      and test set (58 songs). Training with learning rate 5E-6
was stopped after a maximum of 100 iterations or after 20
iterations without improving the validation set error. To al-       Table 1: Official evaluation results on test data.
leviate over-fitting, Gaussian noise with zero mean and stan-
dard deviation 0.6 was added to the input activations, and              Dimension   System        RMSE               r
sequences were presented in random order during training.                           Baseline   0.366 ± 0.18    0.01 ± 0.38
All BLSTM-RNNs were trained with CURRENNT 1 .                                        Run 1     0.331 ± 0.18    0.12 ± 0.54
   We trained 4 kinds of BLSTM-RNNs with different time                   Valence     Run 2     0.308 ± 0.17    0.15 ± 0.47
scale (i.e. sequence length) of 60, 30, 20 and 10, respectively,                     Run 3     0.349 ± 0.19    0.02 ± 0.51
on a training set containing 411 clips, and validated them                           Run 4     0.303 ± 0.19    0.01 ± 0.40
on the remained 20 clips selected randomly according to
the genre distribution of the test data (i.e. 58 complete                           Baseline   0.270 ± 0.11    0.36 ± 0.26
songs). We totally made 5 different data partitions (411 clips                        Run 1     0.230 ± 0.11    0.66 ± 0.25
for training, 20 clips for validation) and computed 3 trials             Arousal     Run 2     0.234 ± 0.11    0.63 ± 0.27
of the same model each with randomized initial weights,                              Run 3     0.240 ± 0.12    0.52 ± 0.37
among which the best one was selected. Hence, there were                             Run 4     0.250 ± 0.15    0.56 ± 0.24
5 different BLSTM-RNNs for each time scale.

2.2.2 Model selection and fusion                                   with the baseline feature set. Run 2) Same as Run 1, but
                                                                   using ELMs for fusion. Run 3) Same as Run 1, but using
   In order to select 4 models with different time scales for fu-
                                                                   new features learnt with the method described in Section
sion, we applied two different criteria separately to compose
                                                                   2.1. In all above runs, test data was segmented into fixed-
two groups of 4 models. The first criterion was RMSE-first
                                                                   length clips with 50% overlap according to the time-scale of
which just selected the model with the best RMSE for each
                                                                   BLSTM-RNNs specified. Run 4) The SVR based hierarchi-
time scale, while the second criterion was considering both
                                                                   cal regression described in Section 2.3.
the RMSE and the data partition to guarantee the training
                                                                      In Table 1, we report the official evaluation metrics (r
sets of the selected models for fusion be different from each
                                                                   - Pearson correlation coefficient; and RMSE - Root Mean
other. In our experiments, there were 2 models shared by
                                                                   Squared Error). The results showed that all runs were sig-
two groups; in other words, there were 6 unique models for
                                                                   nificantly better than the baseline result. Considering the
fusion.
                                                                   comprehensive performance, we observed that Run 2 was
   At the fusion step, we averaged the predictions produced
                                                                   the best one. However, Run 2 was not better than Run 1
by all 6 models as the final result. In addition to this sim-
                                                                   consistently, which indicated that ELMs might be trained
ple fusion policy, we trained an Extreme Learning Machine
                                                                   insufficiently. Both Runs 1 and 2 worked particularly well
(ELM) [4] for fusion. The input feature vector of ELM con-
                                                                   in r, which was attributed to the BLSTM-RNNs’ capabil-
sisted of the original predictions of 4 different time scale
                                                                   ity of mapping sequence to sequence. The reason why the
BLSTM-RNNs, their delta derivatives and the smoothed
                                                                   new features in Run 3 did not make an expected improve-
values generated through a triangle-filter with length of 50.
                                                                   ment might be the low level features were not appropriate
Two separate ELMs were constructed to fuse the corre-
                                                                   to represent different time-scales. Although the method in
sponding predictions of the two model groups mentioned
                                                                   Run 4 was simple, it delivered comparable RMSE and r for
above. Finally, the outputs of two ELMs were averaged to
                                                                   Arousal among all runs, and performed quite well for Va-
produce the final emotion prediction.
                                                                   lence, but only in RMSE not in r, which may be related to
2.3 Hierarchical Regression                                        the decomposition of global trend and local fluctuation. We
                                                                   believe it is a promising algorithm.
   The aim of hierarchical regression is to predict the global
trend and local fluctuation of music dynamic emotion sep-
arately. Firstly, a global Support Vector Regression (SVR)         4.    CONCLUSIONS
was built to predict the mean of dynamic emotion attributes           We describe THU-HCSIL teams approaches to the Emo-
of whole song with 6373 song-level global features extracted       tion in Music task at MediaEval 2015. Several multi-scale
using OpenSMILE toolbox with IS13_ComParE configuration            approaches at three levels have been compared with the
(see [9] for details). Then, OpenSMILE with configuration          baseline system, including acoustic feature learning, multi-
IS13_ComParE_lld was used to extract 130 segment-level             regression fusion and hierarchical prediction of emotion fea-
features whose means and standard deviations were calcu-           tures. The results show that the proposed methods are sig-
lated with 1s window and 0.5s shift to form local features         nificantly better than the baseline system, illustrating the
to predict the fluctuation of dynamic emotion attributes for       effectiveness of the multi-scale approaches. In future work,
each 0.5s clip by a local SVR. Finally, each fluctuation value     we plan to investigate how to select the time scale auto-
predicted by the local SVR and the mean value predicted by         matically and systematically. In addition, the audio files of
the global SVR were added to form the final emotion pre-           the test data in the pre-training stage of submitted Runs
diction for the corresponding 0.5s clip.                           1–3 may limit the generalizability of the trained model and
                                                                   some more evaluations are needed.
3. RUNS AND EVALUATION RESULTS
                                                                   5.    ACKNOWLEDGEMENTS
   We submitted four runs for the task this year. The specifics
of each run are as follows: Run 1) Multi-scale BLSTM-RNNs            This work was partially supported by the National Nat-
based Fusion with the simple average policy was performed          ural Science Foundation of China (No.61171116, 61433018),
                                                                   the National High Technology Research and Development
1
    https://sourceforge.net/p/currennt                             Program of China (863 Program) (No. 2015AA016305).
6. REFERENCES
 [1] A. Aljanaki, Y.-H. Yang, and M. Soleymani. Emotion
     in music task at mediaeval 2015. In Working Notes
     Proceedings of the MediaEval 2015 Workshop,
     September 2015.
 [2] Y. C. Fan, Y. Qian, F. L. Xie, and F. K. Soong. Tts
     synthesis with bidirectional lstm based recurrent
     neural networks. In The 15th Annual Conference of
     the International Speech Communication Association
     (INTERSPEECH), 2014.
 [3] A. Graves and J. Schmidhuber. Framewise phoneme
     classification with bidirectional lstm and other neural
     network architectures. Neural Networks,
     18(5–6):602–610, June 2005.
 [4] G. Huang, Q. Zhu, and C. Siew. Extreme learning
     machine: theory and applications. Neurocomputing,
     70(1):489–501, 2006.
 [5] N. Kumar, R. Gupta, T. Guha, and C. Vaz. Affective
     feature design and predicting continuous affective
     dimensions from music. In Working Notes Proceedings
     of the MediaEval 2014 Workshop, October 2014.
 [6] O. Lartillot and P. Toiviainen. A matlab toolbox for
     music feature extraction from audio. In International
     Conference on Digital Audio Effects, pages 237 – 244,
     2007.
 [7] H. Sak, A. Senior, and F. Beaufays. Long short-term
     memory based recurrent neural network architectures
     for large vocabulary speech recognition. arXiv preprint
     arXiv:1402.1128, 2014.
 [8] L. Sun, S. Kang, K. Li, and H. Meng. Voice conversion
     using deep bidirectional long short-term memory
     based recurrent neural networks. In International
     Conference on Acoustics, Speech, and Signal
     Processing (ICASSP), pages 4869–4873, April 2015.
 [9] F. Weninger, F. Eyben, B. W. Schuller, M. Mortillaro,
     and K. R. Scherer. On the acoustics of emotion in
     audio: What speech, music and sound have in
     common. Frontiers in Psychology, 4 (Article ID
     292):1–12, May 2013.
[10] M. Wollmer, Z. X. Zhang, F. Weninger, B. Schuller,
     and G. Rigoll. Feature enhancement by bidirectional
     lstm networks for conversational speech recognition in
     highly non-stationary noise. In International
     Conference on Acoustics, Speech, and Signal
     Processing (ICASSP), 2013.
[11] M. Xu and H. Xianyu. Heterogeneity-entropy based
     unsupervised feature learning for personality
     prediction with cross-media data. submitted to The
     Thirtieth AAAI Conference on Artificial Intelligence,
     2016.