-

The TUM Approach to the MediaEval Music Emotion Task Using Generic Affective Audio Features

Felix Weninger, Florian Eyben

eyben@tum.de weninger@tum.de {weninger,eyben}@tum.de 1

Björn Schuller

schuller@IEEE.org 0 0 Department of Computing, Imperial College London , London SW7 2AZ , UK 1 Machine Intelligence & Signal Processing Group, MMK, Technische Universität München , 80290 Munich , Germany

2013

18 19

include mean, moments, quartiles, 1- and 99-percentiles, as well as This paper describes the TUM approach for the MediaEval contour related measurements such as (relative) rise and fall times, Emotion in Music task which consists of non-prototypical amplitudes and standard deviations of local maxima ('peaks'), and music retrieved from the web, annotated by crowdsourcing. linear and quadratic regression coefficients. An exhaustive list of We use Support Vector Machines and BLSTM recurrent the LLDs and functionals along with a detailed analysis of feature neural networks for static and dynamic arousal and valence relevance for music mood regression is found in [3]. Extraction regression. A generic set of acoustic features is used that of acoustic features is done with our open-source toolkit openShas been proven e ective for a ect prediction across multiple MILE [4] which can be used 'out-of-the-box' to extract the Comdomains. In the result, the best models explain 64 and 48 % ParE set, so that our features can be reproduced by the interested of the annotations' variance for arousal and valence in the reader. Prior to feature extraction, songs are normalized to -3 dB static case, and an average Kendall's tau with the songs' maximum amplitude using 'sox'. This is done to remove noise in emotion contour of .18 and .12 is achieved in the dynamic energy-related features and improve generalization. case. As regressors, we use Support Vector Regression (SVR) for songlevel regression and bidirectional Long Short-Term Memory recurrent neural networks (BLSTM-RNNs) for dynamic regression. 1. INTRODUCTION Both use the same input features, normalized to the range [ 1; +1] The 2013 MediaEval 'Emotion in Music' task is to provide con- for SVR and standardized to zero mean and unit variance (on the tinuous valued arousal and valence estimates both for whole songs training data) for BLSTM-RNNs. Separate SVR models are trained (static) and sequences of one second long segments (dynamic). For for arousal and valence regression while BLSTM-RNNs learn both details on the task, we refer to the paper describing the task [1]. In arousal and valence prediction in a multi-task learning fashion. For the following we describe our approach. BLSTM-RNNs, the regression targets are standardized as well. In addition, we investigate adding delta regression coefficients of the arousal and valence targets as additional regression tasks, in or2. METHOD der to improve modeling of the dynamic emotion profile. The Our approach is based on supra-segmental features calculated complexity constant for SVR training was varied from 10 4 to by applying statistical functionals, such as mean and moments, to 10 1. BLSTM-RNNs with two hidden layers (128 LSTM units per the contours of frame-wise low-level descriptors (LLDs), such as layer and direction) are used; thus, the first layer performs inforMFCCs or energy, over either fixed length segments (one second, mation reduction to a 128-dimensional feature set. The segments corresponding to the annotated intervals in the corpus) or whole of each song are processed in order, forming sequences. Gradisongs. In particular, we use the set of affective features developed ent descent with 25 sequences per weight update is used for trainas baseline for the 2013 Computational Paralinguistics Evaluation ing. An early stopping strategy is used, using a held out valida(ComParE) campaign [2]. It has been shown in [3] that this set tion set in each fold. Training is stopped after a maximum of 100 provides robust cross-domain assessment of emotion (continuous iterations or after 20 iterations without improving the validation arousal and valence) in speech, music, and acoustic events. Despite set error (sum of squared errors). To alleviate over-fitting to the its rather 'brute-force' nature, it has been shown to outperform a high dimensional input feature set, Gaussian noise with zero mean more hand-crafted set of musically motivated features for the task and standard deviation 0.6 is added to the input activations, and of music mood regression. sequences are presented in random order during training. SVR The ComParE feature set contains 6 373 features. LLDs include models are trained with Weka [5] using Sequential Minimal Optiauditory weighted frequency bands, their sum (corresponding to mization (SMO). BLSTM-RNNs are trained with our open-source loudness), spectral measures such as centroid, roll-of point, skew- CUda RecuRrent Neural Network Toolkit (CURRENNT)1 for furness, sharpness, and spectral flux. Furthermore, voicing related ther reproducibility. All hyper-parameters not mentioned in the LLDs such as fundamental frequency (corresponding to 'main melody') above are left at the toolkits' defaults. and harmonics-to-noise ratio (corresponding to 'percussiveness') are added. Delta regression coefficients (weighted discrete derivatives) are added to capture time dynamics. Statistical functionals 3. RESULTS

1https://sourceforge.net/p/currennt (a) Song level, SVR (b) Segment level, BLSTM (c) Song level, BLSTM (average segment level predictions)

Tasks

A+V A+V+

Arousal

R2 MLE .081 .080

Valence

R2 MLE .087 .088 set. Evaluation measures are computed on the entire development set (not by averaging across folds). The fold subdivision follows a simple modulo based scheme (song ID modulo 10), and is thus easily reproducible and song independent (in the case of regression on segments). We report the official challenge metrics, determination coefficient (R2) for whole song regression and average Kendall’s per song ( ) for segment regression, along with mean linear error (MLE). MLE is calculated after scaling the annotations to the range [ 0:5; +0:5]. On segment level, we also report R2 (across all segments) to assess the overall regression performance without taking into account the modeling of the emotional profile of a song.

In short, we observe that (a) SVR performance is very sensitive to the complexity parameter; (b) R2 on segment level is very high compared to , indicating the difficulty of estimating the dynamics of the annotation contour within a song instead of the overall emotion; (c) adding deltas to the regression targets improves for arousal, but not valence prediction; (d) best song level results in terms of R2 are obtained by averaging BLSTM predictions, outperforming SVR by a large margin for valence (.499 vs. .419). In the following the configurations for our test set runs are summarized.

Static task (song level): 1. SVR: SVR with C = 10 3, trained on the entire development set 2. BLSTM-PA-Song: BLSTM-RNNs trained on the 10 training folds of the development set; segment level predictions averaged within songs and across networks 3. BLSTM-WA-Song: BLSTM-RNN trained on the 10 training folds of the development set by weight averaging; segment level predictions averaged within songs Dynamic task (segment level): 1. BLSTM-PA-Seg: BLSTM-RNNs trained on the 10 training folds of the development set; predictions averaged across networks (a) Song level (`Static task')

Run name

SVR .646

BLSTM-PA-Song .642 BLSTM-WA-Song .643 Arousal

R2 MLE .083 .085 .085 (b) Segment level (`Dynamic task')

Run name BLSTM-PA-Seg BLSTM-WA-Seg .180 .174 Arousal

2. BLSTM-WA-Seg: BLSTM-RNNs trained on the 10 training folds of the development set by weight averaging To deliver BLSTM predictions on the test set, we either average the predictions of the 10 networks trained on the development set (PA), or average their weights and run additional training iterations on the entire development set (WA).

Table 2 shows that BLSTM-RNNs outperform SVR on the song level for valence while being on par for arousal. This is consistent with the development set results. On the segment level, the WA strategy delivers slightly worse results in terms of than PA while using a 10 times smaller model. 4.

CONCLUSION

We have presented the TUM approach to the 2013 MediaEval Emotion in Music task. Best results on the static (song level) task were obtained by averaging time-varying predictions of a BLSTMRNN. BLSTM-RNNs also delivered consistent improvements over the baseline in the dynamic task. 5.

[1]

Soleymani ,

Caro ,

E. M.

Schmidt , C.-Y. Sha, and Y. -H. Yang , “ 1000 songs for emotional analysis of music,” in Proc. of CrowdMM (held in conjunction with ACM MM) . Barcelona, Spain: ACM, 2013 , to appear.

[2]

Schuller ,

Steidl ,

Batliner ,

Vinciarelli ,

Scherer ,

Ringeval ,

Chetouani et al., “The INTERSPEECH 2013 Computational Paralinguistics Challenge: Social Signals , Conflict, Emotion, Autism,” in Proc. of INTERSPEECH. Lyon , France: ISCA, 2013 , pp. 148 - 152 .

[3]

Weninger ,

Eyben ,

B. W.

Schuller ,

Mortillaro , and

K. R.

Scherer , “ On the Acoustics of Emotion in Audio: What Speech, Music and Sound have in Common,” Frontiers in Emotion Science , vol. 4 , no. Article ID 292 , pp. 1 - 12 , May 2013 .

[4]

Eyben ,

Weninger ,

Groß , and

Schuller , “ Recent Developments in openSMILE, the Munich Open-Source Multimedia Feature Extractor,” in Proc. of ACM MM. Barcelona , Spain: ACM, October 2013 , 4 pages, to appear.

[5]

Hall , E. Frank,

Holmes ,

Pfahringer ,

Reutemann ,

and I. H.

Witten , “ The WEKA data mining software: an update , ” ACM SIGKDD Explorations Newsletter , vol. 11 , no. 1 , pp. 10 - 18 , 2009 .