=Paper=
{{Paper
|id=Vol-1436/Paper77
|storemode=property
|title=Multi-Scale Approaches to the MediaEval 2015 ``Emotion in Music" Task
|pdfUrl=https://ceur-ws.org/Vol-1436/Paper77.pdf
|volume=Vol-1436
|dblpUrl=https://dblp.org/rec/conf/mediaeval/XuLXTMC15
}}
==Multi-Scale Approaches to the MediaEval 2015 ``Emotion in Music" Task==
Multi-scale Approaches to the MediaEval 2015 “Emotion in Music" Task Mingxing Xu, Xinxing Li, Haishu Xianyu, Jiashen Tian, Fanhang Meng, Wenxiao Chen Key Laboratory of Pervasive Computing, Ministry of Education Tsinghua National Laboratory for Information Science and Technology (TNList) Department of Computer Science and Technology, Tsinghua University, Beijing, China xumx@tsinghua.edu.cn, {lixinxing1991, xyhs2010}@126.com ABSTRACT 2. METHODOLOGY The goal of the “Emotion in Music” task in MediaEval 2015 is to automatically estimate the emotions expressed by music 2.1 Feature Learning (in terms of Arousal and Valence) in a time-continuous fash- We used openSMILE toolbox to extract 65 Low-Level De- ion. In this paper, considering the high context correlation scriptors (LLDs) with configuration IS13_ComParE_lld (see among the music feature sequence, we study several multi- [9] for details) and divided them into 3 groups as follows: A) scale approaches at different levels, including acoustic fea- 26 LLDs related to audSpec; B) 29 LLDs related with pcm- ture learning with Deep Brief Networks (DBNs) followed a fftMag and Mel-Frequency Cepstral Coefficient (MFCC); C) modified Autoencoder (AE), bi-directional Long-Short Term 10 LLDs related to voice. In addition, we adopted the idea Memory Recurrent Neural Networks (BLSTM-RNNs) based proposed in [5] to extract Compressibility (comp), Spectral multi-scale regression fusion with Extreme Learning Ma- Centre of MASS (SCOM) and Median Spectral Band Energy chine (ELM), and hierarchical prediction with Support Vec- (MSBE) at the local scale, and used MIR Toolbox [6] to ex- tor Regression (SVR). The evaluation performances of all tract 20 other features related to music attributes, including runs submitted are significantly better than the baseline pro- dynamic RMS energy, Tempo, Event Density, Spectrum cen- vided by the organizers, illustrating the effectiveness of the troid, Flatness, Irregularity, Skewness, Kurtosis, Rolloff85, proposed approaches. Rolloff95, Spread, Brightness, Roughness, Entropy, Spectral Flux, Zero crossing rate, HCDF, Key mode, Key clarity and Chromagram centroid, and then assembled them as group 1. INTRODUCTION D. The frame size was 60 ms for group C and 25 ms for other The MediaEval 2015 “Emotion in Music” has only one groups. In all groups, overlapping windows were used with task dynamic emotion characterization, including two re- a 10 ms step. quired runs (one for feature extraction with linear regres- For features of each group in 1 s window with 0.5 s overlap, sion, another for regression model with the baseline feature we calculated the mean, STD, slope and Shannon entropy set provided by the organizers) and up to other three runs functionals, delta coefficients together with the STD and (any combination of the features and machine learning tech- slope functionals, and acceleration coefficients together with niques) to permit a thorough comparison between different the STD functionals. This resulted in 4 feature sets with methods. In the task this year, the development data con- dimension 182, 203, 70 and 161, respectively. tains 431 clips with the best annotation agreement selected Four different Deep Belief Networks (DBNs) were used to from the last year data and the evaluation data consists of learn the higher representation for each group features inde- 58 full-length songs. For more details, please refer to [1]. pendently, which were then fused by a special Autoencoder In order to predict and trace the evolution of music emo- with a modified cost function considering sparse and hetero- tion more precisely, we investigated several multi-scale meth- geneous entropy (details described in [11]), to produce the ods implemented at three different levels, including acoustic final features at a rate of 2 Hz for the succeeding regression. feature level, regression model level and emotion annotation level. For acoustic feature level, features were organized 2.2 Multi-scale BLSTM-RNNs Fusion in groups according to their time scales and fundamentals. Deep learning algorithm was used to learn new features in- 2.2.1 Models training tegrated multi-scale information about music emotion. In- Considering the high context correlation among the mu- spired by BLSTM-RNNs’ capability of mapping sequence to sic emotion feature sequence, we used bi-directional Long sequence [3], we trained some BLSTM-RNNs with different Short-Term Memory recurrent neural networks (BLSTM- length sequences and fused them using the extreme learn- RNNs) which worked quite well on numerous tasks involving ing machine to produce the final prediction. In addition, we sequence modeling in recent years [10, 7, 2, 8], to predict proposed a hierarchical regression to predict the global trend dynamic music emotion. and local fluctuation of music dynamic emotion separately. Separate BSLTM-RNNs were trained for arousal and va- lence regression. BLSTM-RNNs with 5 hidden layers (250 units per layer and direction) were used. The first two lay- Copyright is held by the author/owner(s). ers were pre-trained with whole development set (431 clips) MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany and test set (58 songs). Training with learning rate 5E-6 was stopped after a maximum of 100 iterations or after 20 iterations without improving the validation set error. To al- Table 1: Official evaluation results on test data. leviate over-fitting, Gaussian noise with zero mean and stan- dard deviation 0.6 was added to the input activations, and Dimension System RMSE r sequences were presented in random order during training. Baseline 0.366 ± 0.18 0.01 ± 0.38 All BLSTM-RNNs were trained with CURRENNT 1 . Run 1 0.331 ± 0.18 0.12 ± 0.54 We trained 4 kinds of BLSTM-RNNs with different time Valence Run 2 0.308 ± 0.17 0.15 ± 0.47 scale (i.e. sequence length) of 60, 30, 20 and 10, respectively, Run 3 0.349 ± 0.19 0.02 ± 0.51 on a training set containing 411 clips, and validated them Run 4 0.303 ± 0.19 0.01 ± 0.40 on the remained 20 clips selected randomly according to the genre distribution of the test data (i.e. 58 complete Baseline 0.270 ± 0.11 0.36 ± 0.26 songs). We totally made 5 different data partitions (411 clips Run 1 0.230 ± 0.11 0.66 ± 0.25 for training, 20 clips for validation) and computed 3 trials Arousal Run 2 0.234 ± 0.11 0.63 ± 0.27 of the same model each with randomized initial weights, Run 3 0.240 ± 0.12 0.52 ± 0.37 among which the best one was selected. Hence, there were Run 4 0.250 ± 0.15 0.56 ± 0.24 5 different BLSTM-RNNs for each time scale. 2.2.2 Model selection and fusion with the baseline feature set. Run 2) Same as Run 1, but using ELMs for fusion. Run 3) Same as Run 1, but using In order to select 4 models with different time scales for fu- new features learnt with the method described in Section sion, we applied two different criteria separately to compose 2.1. In all above runs, test data was segmented into fixed- two groups of 4 models. The first criterion was RMSE-first length clips with 50% overlap according to the time-scale of which just selected the model with the best RMSE for each BLSTM-RNNs specified. Run 4) The SVR based hierarchi- time scale, while the second criterion was considering both cal regression described in Section 2.3. the RMSE and the data partition to guarantee the training In Table 1, we report the official evaluation metrics (r sets of the selected models for fusion be different from each - Pearson correlation coefficient; and RMSE - Root Mean other. In our experiments, there were 2 models shared by Squared Error). The results showed that all runs were sig- two groups; in other words, there were 6 unique models for nificantly better than the baseline result. Considering the fusion. comprehensive performance, we observed that Run 2 was At the fusion step, we averaged the predictions produced the best one. However, Run 2 was not better than Run 1 by all 6 models as the final result. In addition to this sim- consistently, which indicated that ELMs might be trained ple fusion policy, we trained an Extreme Learning Machine insufficiently. Both Runs 1 and 2 worked particularly well (ELM) [4] for fusion. The input feature vector of ELM con- in r, which was attributed to the BLSTM-RNNs’ capabil- sisted of the original predictions of 4 different time scale ity of mapping sequence to sequence. The reason why the BLSTM-RNNs, their delta derivatives and the smoothed new features in Run 3 did not make an expected improve- values generated through a triangle-filter with length of 50. ment might be the low level features were not appropriate Two separate ELMs were constructed to fuse the corre- to represent different time-scales. Although the method in sponding predictions of the two model groups mentioned Run 4 was simple, it delivered comparable RMSE and r for above. Finally, the outputs of two ELMs were averaged to Arousal among all runs, and performed quite well for Va- produce the final emotion prediction. lence, but only in RMSE not in r, which may be related to 2.3 Hierarchical Regression the decomposition of global trend and local fluctuation. We believe it is a promising algorithm. The aim of hierarchical regression is to predict the global trend and local fluctuation of music dynamic emotion sep- arately. Firstly, a global Support Vector Regression (SVR) 4. CONCLUSIONS was built to predict the mean of dynamic emotion attributes We describe THU-HCSIL teams approaches to the Emo- of whole song with 6373 song-level global features extracted tion in Music task at MediaEval 2015. Several multi-scale using OpenSMILE toolbox with IS13_ComParE configuration approaches at three levels have been compared with the (see [9] for details). Then, OpenSMILE with configuration baseline system, including acoustic feature learning, multi- IS13_ComParE_lld was used to extract 130 segment-level regression fusion and hierarchical prediction of emotion fea- features whose means and standard deviations were calcu- tures. The results show that the proposed methods are sig- lated with 1s window and 0.5s shift to form local features nificantly better than the baseline system, illustrating the to predict the fluctuation of dynamic emotion attributes for effectiveness of the multi-scale approaches. In future work, each 0.5s clip by a local SVR. Finally, each fluctuation value we plan to investigate how to select the time scale auto- predicted by the local SVR and the mean value predicted by matically and systematically. In addition, the audio files of the global SVR were added to form the final emotion pre- the test data in the pre-training stage of submitted Runs diction for the corresponding 0.5s clip. 1–3 may limit the generalizability of the trained model and some more evaluations are needed. 3. RUNS AND EVALUATION RESULTS 5. ACKNOWLEDGEMENTS We submitted four runs for the task this year. The specifics of each run are as follows: Run 1) Multi-scale BLSTM-RNNs This work was partially supported by the National Nat- based Fusion with the simple average policy was performed ural Science Foundation of China (No.61171116, 61433018), the National High Technology Research and Development 1 https://sourceforge.net/p/currennt Program of China (863 Program) (No. 2015AA016305). 6. REFERENCES [1] A. Aljanaki, Y.-H. Yang, and M. Soleymani. Emotion in music task at mediaeval 2015. In Working Notes Proceedings of the MediaEval 2015 Workshop, September 2015. [2] Y. C. Fan, Y. Qian, F. L. Xie, and F. K. Soong. Tts synthesis with bidirectional lstm based recurrent neural networks. In The 15th Annual Conference of the International Speech Communication Association (INTERSPEECH), 2014. [3] A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 18(5–6):602–610, June 2005. [4] G. Huang, Q. Zhu, and C. Siew. Extreme learning machine: theory and applications. Neurocomputing, 70(1):489–501, 2006. [5] N. Kumar, R. Gupta, T. Guha, and C. Vaz. Affective feature design and predicting continuous affective dimensions from music. In Working Notes Proceedings of the MediaEval 2014 Workshop, October 2014. [6] O. Lartillot and P. Toiviainen. A matlab toolbox for music feature extraction from audio. In International Conference on Digital Audio Effects, pages 237 – 244, 2007. [7] H. Sak, A. Senior, and F. Beaufays. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv preprint arXiv:1402.1128, 2014. [8] L. Sun, S. Kang, K. Li, and H. Meng. Voice conversion using deep bidirectional long short-term memory based recurrent neural networks. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 4869–4873, April 2015. [9] F. Weninger, F. Eyben, B. W. Schuller, M. Mortillaro, and K. R. Scherer. On the acoustics of emotion in audio: What speech, music and sound have in common. Frontiers in Psychology, 4 (Article ID 292):1–12, May 2013. [10] M. Wollmer, Z. X. Zhang, F. Weninger, B. Schuller, and G. Rigoll. Feature enhancement by bidirectional lstm networks for conversational speech recognition in highly non-stationary noise. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2013. [11] M. Xu and H. Xianyu. Heterogeneity-entropy based unsupervised feature learning for personality prediction with cross-media data. submitted to The Thirtieth AAAI Conference on Artificial Intelligence, 2016.