=Paper=
{{Paper
|id=Vol-1436/Paper60
|storemode=property
|title=Time-continuous Estimation of Emotion in Music with Recurrent Neural Networks
|pdfUrl=https://ceur-ws.org/Vol-1436/Paper60.pdf
|volume=Vol-1436
|dblpUrl=https://dblp.org/rec/conf/mediaeval/PellegriniB15
}}
==Time-continuous Estimation of Emotion in Music with Recurrent Neural Networks==
Time-continuous Estimation of Emotion in Music with Recurrent Neural Networks Thomas Pellegrini, Valentin Barrière Université de Toulouse, IRIT, Toulouse, France thomas.pellegrini@irit.fr ABSTRACT used: simple and multi-level linear regression models [12], In this paper, we describe the IRIT’s approach used for the Support Vector machines for regression (SVR) [9], condi- MediaEval 2015 ”Emotion in Music” task. The goal was to tional random fields [14], Long Short-Term Memory and predict two real-valued emotion dimensions, namely valence Recurrent Neural Networks (LSTM-RNN) [7]. This last ap- and arousal, in a time-continuous fashion. We chose to use proach was the one that achieved the best results. Following recurrent neural networks (RNN) for their sequence mod- these results, we chose to use RNNs. All the ML models were eling capabilities. Hyperparameter tuning was performed developed using the Theano toolbox [5]. through a 10-fold cross-validation setup on the 431 songs of the development subset. With the baseline set of 260 acous- 2. METHODOLOGY tic features, our best system achieved averaged root mean In order to tune and test prediction models, we ran 10- squared errors of 0.250 and 0.238, and Pearson’s correla- fold cross-validation (CV) experiments on the development tion coefficients of 0.703 and 0.692, for valence and arousal, data subset. Once the best model was selected and tuned respectively. These results were obtained by first making within this setup, a single model was trained on the whole predictions with an RNN comprised of only 10 hidden units, development subset, and used to generate predictions on the smoothed by a moving average filter, and used as input to official evaluation data subset. a second RNN to generate the final predictions. This sys- The input data were zero-mean and unit-variance nor- tem gave our best results on the official test data subset for malized. Standard PCA, PCA with a Gaussian kernel and arousal (RMSE=0.247, r=0.588), but not for Valence. Va- denoising autoencoders with Gaussian noise were tested to lence predictions were much worse (RMSE=0.365, r=0.029). further process the data, but no improvement was achieved This may be explained by the fact that in the develop- with any of these techniques. ment subset, valence and arousal values were very correlated We chose to use recurrent neural networks (RNN) for their (r=0.626), and this was not the case with the test data. Fi- time sequence modeling capabilities. We used the Elman [8] nally, slight improvements over these figures were obtained model type, in which recurrent connections feed the hidden by adding spectral flatness and spectral valley features to layer. The activations of the hidden layer at time t − 1 are the baseline set. stored and fed back to the same layer at time t together with the data input. A tanh activation function and a soft- 1. INTRODUCTION max function were used for the hidden layer and the final Music Emotion Recognition still is a hot topic in Music In- layer with two outputs (for arousal and valence), respec- formation Retrieval. In [15], the authors list four main issues tively. The layer weights were trained with the standard that explain why MER is a challenging and very interesting mean-root-squared cost function. Weights were updated af- scientific task: 1) ambiguity and granularity of emotion de- ter each forward pass on a single song via the momentum scription, 2) heavy cognitive load of emotion annotation, 3) update rule. subjectivity of emotional perception, 4) semantic gap be- The hyperparameters were tuned with the 10-fold CV tween low-level acoustic features and high-level human per- setup. The best model was comprised of 10 hidden units, ception. It consists of either labeling songs and music pieces trained with a 1.0×10−3 learning rate and a 1.0×10−2 regu- as a whole, thus involving a classification task, or estimating larization coefficient with both L1 and L2 norms. To further emotion dimensions in continuous time and space domains, limit overfitting, an early stopping strategy was used: the being then a regression task applied to time series. This last models were all trained with 50 iterations only. This number case is the objective of the current challenge. For a complete of iterations was set empirically. description of the task and corpus involved in challenge, the A moving average filter was used to smooth the predic- reader may refer to [4]. tions. Its size was tuned in the 10-fold CV setup, and the For continuous-space MER, many machine learning (ML) best one was a window of 13 points. To avoid unwanted techniques were reported in the literature [11]. In the Me- border effects, the first and last 6 points, corresponding to diaEval 2014 challenge edition, a variety of techniques were the filter delay, were equaled to the unfiltered predictions. Another post-processing step was tested. It consisted of feeding another RNN with the predictions of the first RNN. Copyright is held by the author/owner(s). By looking at the output, one can see that the second RNN MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany further smoothed the predictions. Table 1: 10-fold cross-validation (CV) and official evaluation test (Eval) results. lr : linear regression model, BSL: baseline results provided by the organizers, rnn: RNN, rnn2 : RNN fed with the predictions of the first RNN. System CV Eval Valence Arousal Valence Arousal RMSE r RMSE r RMSE r RMSE r lr, 260 feat. .275 .637 .254 .646 N/A N/A N/A N/A BSL, 260 feat. N/A N/A N/A N/A .366 ± .18 .01 ± .38 .27 ± .11 .36 ± .26 rnn, 260 feat. .261 .675 .246 .670 .377 ± .181 .017 ± .420 .259 ± .112 .518 ± .238 + smoothing .254 .694 .239 .689 .365 ± .188 .029 ± .476 .247 ± .116 .588 ± .235 + rnn2 .250 .703 .238 .692 N/A N/A N/A N/A rnn, 268 feat. .259 .678 .245 .673 .373 ± .180 .023 ± .422 .254 ± .106 .532 ± .224 + smoothing .252 .697 .238 .692 .361 ± .187 .044 ± .487 .243 ± .111 .612 ± .216 + rnn2 .249 .706 .238 .694 .371 ± .194 .044 ± .525 .244 ± .115 .635 ± .222 The challenge rules also allowed to use our own acoustic ues for valence, 0.254 and 0.246 for arousal, respectively. features. To complete the 260 baseline features, a set of Since the number of runs was limited, we did not submit 29 acoustic feature types were extracted with the ESSEN- predictions with lr on Eval. As expected, this shows that TIA toolbox [6], which is a toolbox specifically designed for the sequential modeling capabilities of the RNN are useful Music Information Retrieval. The 29 feature types such as for this task. Adding the extra 8 features brought slight Bark and Erb bands that use several frequency bands re- improvement (rnn, 268feat.). Smoothing the network pre- sulted in a total of 196 real values per audio frame. The dictions brought further improvement, using either 260 or same frame rate as the baseline feature one was used (0.5s 268 features as input. Finally, using the predictions as in- window duration and hop size). We chose the feature types put to a second RNN brought slight improvement too. The among a large list, from the spectral domain mainly, such as best system achieved averaged RMSE of 0.250 and 0.238, the so-called spectral ”contrast”, ”valley”, ”complexity”, but and Pearson’s correlation coefficients of 0.703 and 0.692, for also a few features from the time domain, such as ”dance- valence and arousal, respectively. ability”. For a complete list and description of the available Concerning the Eval results, this system also gave the feature extraction algorithms, the reader may refer to the best results on the official test data subset but for arousal ESSENTIA API documentation Web page [2]. (RMSE=0.247, r=0.588) only. Valence predictions were much In order to select useful features, we tested each feature worse (RMSE=0.365, r=0.029). This may be explained by type by adding them one at a time to the baseline feature the fact that in the development subset, valence and arousal set. Only three feature types were found to improve the values were very correlated (r=.626), and this was not the baseline CV performance: two variants of spectral flatness case with the test data, as hypothesized by the challenge and a feature called ”spectral valley”. The two spectral flat- organizers. This performance discrepancy was also observed ness features use two different frequency scales: the Bark by the organizers that provided baseline results (’BSL’) us- and the Equivalent Rectangular Bandwidth (ERB) scales. ing a linear regression model [4]. Only our best three arousal 25 Bark bands were used, as computed in ESSENTIA [1, 3]. predictions outperformed the BSL results significantly. The ERB scale consists of applying a frequency domain fil- terbank using gammatone filters [13]. Spectral flatness pro- 4. CONCLUSIONS vides a way to quantify how noise-like a sound is, as opposed In this paper, we described our experiments using RNNs to being tone-like. Spectral valley is a feature derived from for the 2015 MediaEval Emotion in Music task. As expected, the so-called spectral contrast feature, which represents the the sequence modeling capabilities revealed useful for this relative spectral distribution [10]. This feature was shown task since basic linear regression models were outperformed to perform better than Mel frequency cepstral coefficients in in our cross-validation experiments. Prediction smoothing the task of music type classification [10]. also revealed useful. The best results were obtained when using smoothed predictions fed into a second RNN for both valence and arousal in our CV experiments, and only for 3. RESULTS arousal on the official test set. The observed performance Results are shown in Table 1, for both the cross-validation discrepancy between the valence and arousal variables may experiments and the runs on the official evaluation test data be due to the differences between the development and test subset, referred to as ’CV’ and ’Eval’, respectively. The data: valence and arousal values were very much correlated results are reported in terms of root-mean-squared error in the development dataset, and much less in the test data (RMSE) and Pearson’s correlation coefficient (r). set. Concerning the acoustic feature set, slight improve- Generally speaking, valence predictions were less accurate ments were obtained by adding spectral flatness and spectral than the arousal ones, unlike the performance results re- valley features to the baseline feature set. As future work, ported in the 2014’s edition, as reported in [7], for exam- we plan to further explore denoising encoders, LSTM-RNNs, ple. Concerning the CV results, the simple linear regres- since our first experiments with these models did not show sion model (lr ) was outperformed by the RNN model with improvement compared to the use of basic RNNs. the baseline 260 features, with 0.275 and 0.261 RMSE val- 5. REFERENCES [1] The bark frequency scale. http://ccrma.stanford. edu/~jos/bbt/Bark_Frequency_Scale.html. Accessed: 2015-08-24. [2] Essentia algorithm documentation web page. http://essentia.upf.edu/documentation/ algorithms_reference.html. Accessed: 2015-08-24. [3] The essentia bark documentation page. http://essentia.upf.edu/documentation/ reference/std_BarkBands.html. Accessed: 2015-08-24. [4] A. Aljanaki, Y.-H. Yang, and M. Soleymani. Emotion in music task at mediaeval 2015. In Working Notes Proceedings of the MediaEval 2015 Workshop, September 2015. [5] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio. Theano: a cpu and gpu math expression compiler. In Proc. of the Python for scientific computing conference (SciPy), volume 4, page 3, Austin, 2010. [6] D. Bogdanov, N. Wack, E. Gómez, S. Gulati, P. Herrera, O. Mayor, and et al. ESSENTIA: an Audio Analysis Library for Music Information Retrieval. In Proc. International Society for Music Information Retrieval Conference (ISMIR’13), pages 493–498, Curitiba, 2013. [7] E. Coutinho, F. Weninger, B. Schuller, and K. Scherer. The Munich LSTM-RNN Approach to the MediaEval 2014 âĂIJEmotion in MusicâĂİ Task. In Working Notes Proceedings of the MediaEval 2014 Workshop, Barcelona, 2014. [8] J. Elman. Finding structure in time. Cognitive Science, 14(2), 1990. [9] V. Imbrasaite and P. Robinson. Music emotion tracking with continuous conditional neural fields and relative representation. 2014. [10] D. Jiang, L. Lu, H.-J. Zhang, J.-H. Tao, and L.-H. Cai. Music type classification by spectral contrast feature. In Proc. ICME, volume 1, pages 113–116, Lausanne, 2002. [11] Y. Kim, E. Schmidt, R. igneco, O. Morton, P. Richardson, J. Scott, J. Speck, and D. Turnbull. Emotion recognition: a state of the art review. In 11th International Society for Music Information and Retrieval Conference, Utrecht, 2010. [12] N. Kumar, R. Gupta, T. Guha, C. Vaz, M. Van Segbroeck, J. Kim, and S. Narayanan. Affective feature design and predicting continuous affective dimensions from music. In MediaEval Workshop, Barcelona, 2014. [13] B. C. Moore and B. R. Glasberg. Suggested formulae for calculating auditory-filter bandwidths and excitation patterns. Journal of the Acoustical Society of America, 74:3:750–753, 1983. [14] W. Yang, K. Cai, B. Wu, Y. Wang, X. Chen, D. Yang, and A. Horner. BeatsensâĂŹ Solution for MediaEval 2014 Emotion in Music Task. 2014. [15] Y.-H. Yang and H. H. Chen. Music emotion recognition. CRC Press, 2011.