Music Emotion Tracking with Continuous Conditional Neural Fields and Relative Representation Vaiva Imbrasaitė Peter Robinson Computer Laboratory Computer Laboratory University of Cambridge University of Cambridge United Kingdom United Kingdom Vaiva.Imbrasaite@cl.cam.ac.uk Peter.Robinson@cl.cam.ac.uk ABSTRACT ture representation. For the second feature set we used a This working notes paper introduces the system proposed post-processing step to transform it into the relative feature by the Rainbow group for the MediaEval Emotion in Music representation—we calculated the average for each feature 2014 task. The task is concerned with predicting dynamic over each song and for each feature in each feature vector emotion labels for an excerpt of a song and for our approach we used the average and the difference between the average we use Continuous Conditional Neural Fields and relative and the actual feature to represent it. We thus doubled the feature representation both of which have been developed size of the feature vector to 300. Relative feature represen- or adapted by our group. tation is based on the idea of expectation in music. In the past we have shown [2] that using this feature representa- tion can lead to substantially better results, improving the 1. INTRODUCTION correlation coefficient by over 10% for both axes. The Emotion in Music task is concerned with providing dynamic arousal and valence labels and is described in the 2.2 CCNF paper by Aljanaki et al.[1]. Our CCNF model consists of a undirected graphical model The use of relative feature representation has already been that can model the conditional probability of a continuous introduced to the field of dynamic music annotation and valued vector y (for example emotion in valence space) de- tested on MoodSwings dataset [4] by Imbrasaitė et al.[2]. pending on continuous x (in this case, audio features). They have shown substantial improvement over using stan- In our discussion we will use the following notation: x = dard feature representation with the standard Support Vec- {x1 , x2 , . . . , xn } is a set of observed input variables, X is a tor Regression (SVR) approach as well as comparable per- matrix where the ith column represents xi , formance to more complicated machine learning techniques y = {y1 , y2 , . . . , yn } is a set of output variables that we wish such as Continuous Conditional Random Fields. to predict, xi ∈ Rm and yi ∈ R (patch expert response), n Continuous Conditional Neural Fields (CCNF) have also is the length of the sequence of interest. been used for dynamic music annotation by Imbrasaitė et Our model for a particular set of observations is a condi- al.[3]. In our experiments we have achieved results that tional probability distribution with the probability density clearly outperformed SVR when using standard feature rep- function: resentation and produced similar results to using relative feature representation. It was suspected that the short ex- exp(Ψ) tracts (only 15s) and little variation in emotion were the P (y|x) = R ∞ (1) −∞ exp(Ψ)dy main reasons why the model was not able to achieve better results. In this paper we are applying the same techniques We define two types of features in our model: vertex fea- to a dataset that improves on both accounts with a hope of tures fk and edge features gk . Our potential function is clearer results. defined as: K1 XX K2 XX 2. METHOD Ψ= αk fk (yi , xi , θ k ) + βk gk (yi , yj ) (2) i k=1 i,j k=1 2.1 Feature extraction and representation We constrain αk > 0 and βk > 0, while Θ is unconstrained. In our system we used two feature sets. Both feature The model parameters α = {α1 , α2 , . . . αK1 }, sets were extracted by OpenSMILE using a standard set of Θ = {θ 1 , θ 2 , . . . θ K1 }, and β = {β1 , β2 , . . . βK2 } are learned features. As CCNF can suffer when dealing with a large and used for inference during testing feature vector and fail to converge, we used a limited set of The vertex features fk represent the mapping from the xi statistical descriptors extracted from the features limiting to yi through a one layer neural network, where θ k is the the total number of features to 150. weight vector for a particular neuron k. The first feature set was used as is, in the standard fea- fk (yi , xi , θ k ) = −(yi − h(θ k , xi ))2 (3) Copyright is held by the author/owner(s). 1 h(θ, xi ) = (4) MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain 1 + e−θT xi Table 1: Results for both the SVR and the CCNF models, using both the standard and the relative feature representation techniques Arousal Valence rho range RMSE range rho range RMSE range Baseline 0.18 +/-0.36 0.27 +/-0.12 0.11 +/-0.34 0.19 +/-0.11 Basic SVR 0.129 +/-0.325 0.146 +/-0.062 0.073 +/-0.267 0.100 +/-0.055 Basic CCNF 0.116 +/-0.632 0.139 +/-0.068 0.063 +/-0.593 0.102 +/-0.064 Relative SVR 0.148 +/-0.326 0.147 +/-0.064 0.074 +/-0.290 0.099 +/-0.062 Relative CCNF 0.181 +/-0.604 0.118 +/-0.069 0.066 +/-0.530 0.098 +/-0.062 The number of vertex features K1 is determined experi- Secondly, the spread for correlation for CCNF model is twice mentally during cross-validation, and in our experiments we as big as the one for SVR, while there is little difference be- tried K1 = {5, 10, 20, 30}. tween the spread for RMSE for the different methods. In The edge features gk represent the similarities between fact, there is little difference in performance between the observations yi and yj . This is also affected by the neighbor- different methods and the different representations used for hood measure S (k) , which allows us to control the existence the valence axis. of such connections. 4. FURTHER INSIGHTS 1 (k) gk (yi , yj ) = − Si,j (yi − yj )2 . (5) We found it interesting to compare the results achieved 2 with this dataset to those achieved with the MoodSwings In our linear chain CCNF model, gk enforces smoothness dataset. This shows how much of an impact the dataset has between neighboring nodes. We define a single edge feature, on the performance and even the ranking of different meth- i.e. K2 = 1. We define S (1) to be 1 only when the two nodes ods. In our previous work CCNF clearly outperformed SVR i and j are neighbors in a chain, otherwise it is 0. with the standard feature representation, while the results with the relative feature representation were comparable be- 2.2.1 Learning and Inference tween the two models. With this dataset, we would have to We are given training data {x(q) , y(q) }M q=1 of M song sam- draw very different conclusions—with the standard repre- ples, together with their corresponding dimensional contin- sentation the results were comparable, if not better for SVR, uous emotion labels. The dimensions are trained separately, while there was a clear difference between the two when us- but all the parameters (α, β and Θ) for each dimension are ing the relative feature representation for the arousal axis, optimised jointly. with CCNF clearly outperforming SVR. This maybe due to We convert the Eq.(1) into multivariate Gaussian form. the fact that there are more training (and testing) samples It helps with the derivation of the partial derivatives of log- in this dataset, the extracts are longer and, possibly, better likelihood, and with the inference. suited to the task. For learning we can use the constrained limited memory The valence axis is still proving problematic. The fact that Broyden-Fletcher-Goldfarb-Shanno algorithm for finding lo- quite heavyweight techniques are not able to outperform cally optimal model parameters. We use the standard Mat- simple models with small feature vectors seems to be in- lab implementation of the algorithm. In order to make the dicating that we are approaching the problem from a wrong optimisation both more accurate and faster we used the par- angle. Improving results for the valence axis should be the tial derivatives of the log P (y|x), which are straightforward top priority for our future work. to derive and are similar to those of CCRF [2]. A more thorough description of the model as well as the 5. REFERENCES code to reproduce the results can be found at http://www.cl.cam.ac.uk/research/rainbow/projects/ccnf/ [1] A. Aljanaki, Y.-H. Yang, and M. Soleymani. Emotion in music task at mediaeval 2014. In MediaEval Workshop, 2014. 3. RESULTS [2] V. Imbrasaite, T. Baltrusaitis, and P. Robinson. In order to get a better understanding of where CCNF Emotion tracking in music using continuous conditional stands in terms of performance, we decided to compare it random fields and relative feature representation. In to another standard approach used in the field. We used Proc. of ICME. IEEE, 2013. Support Vector Regression (SVR) model with the Radial [3] V. Imbrasaitė, T. Baltrušaitis, and P. Robinson. Ccnf Basis Function kernel in the same way we used CCNF—we for continuous emotion tracking in music: Comparison trained a model for each axis, using 2-fold cross-validation with ccrf and relative feature representation. In Proc. of to pick the best parameters for training. The experimental ICME. IEEE, 2014. design was identical to the one used in our previous paper [3], [4] J. A. Speck, E. M. Schmidt, B. G. Morton, and Y. E. which makes the results comparable not only to the baseline Kim. A comparative study of collaborative vs. method in this challenge, but also between several datasets. traditional music mood annotation. Proc. of ISMIR, There are several interesting trends visible from the results 2011. (see Table 1). First of all, CCNF combined with the rela- tive feature representation clearly outperforms all the other methods for the arousal axis, as well as the baseline method.