Music Emotion Tracking with Continuous Conditional
           Neural Fields and Relative Representation

                        Vaiva Imbrasaitė                                          Peter Robinson
                       Computer Laboratory                                        Computer Laboratory
                      University of Cambridge                                    University of Cambridge
                          United Kingdom                                             United Kingdom
              Vaiva.Imbrasaite@cl.cam.ac.uk                            Peter.Robinson@cl.cam.ac.uk

ABSTRACT                                                         ture representation. For the second feature set we used a
This working notes paper introduces the system proposed          post-processing step to transform it into the relative feature
by the Rainbow group for the MediaEval Emotion in Music          representation—we calculated the average for each feature
2014 task. The task is concerned with predicting dynamic         over each song and for each feature in each feature vector
emotion labels for an excerpt of a song and for our approach     we used the average and the difference between the average
we use Continuous Conditional Neural Fields and relative         and the actual feature to represent it. We thus doubled the
feature representation both of which have been developed         size of the feature vector to 300. Relative feature represen-
or adapted by our group.                                         tation is based on the idea of expectation in music. In the
                                                                 past we have shown [2] that using this feature representa-
                                                                 tion can lead to substantially better results, improving the
1.    INTRODUCTION                                               correlation coefficient by over 10% for both axes.
   The Emotion in Music task is concerned with providing
dynamic arousal and valence labels and is described in the       2.2     CCNF
paper by Aljanaki et al.[1].                                        Our CCNF model consists of a undirected graphical model
   The use of relative feature representation has already been   that can model the conditional probability of a continuous
introduced to the field of dynamic music annotation and          valued vector y (for example emotion in valence space) de-
tested on MoodSwings dataset [4] by Imbrasaitė et al.[2].       pending on continuous x (in this case, audio features).
They have shown substantial improvement over using stan-            In our discussion we will use the following notation: x =
dard feature representation with the standard Support Vec-       {x1 , x2 , . . . , xn } is a set of observed input variables, X is a
tor Regression (SVR) approach as well as comparable per-         matrix where the ith column represents xi ,
formance to more complicated machine learning techniques         y = {y1 , y2 , . . . , yn } is a set of output variables that we wish
such as Continuous Conditional Random Fields.                    to predict, xi ∈ Rm and yi ∈ R (patch expert response), n
   Continuous Conditional Neural Fields (CCNF) have also         is the length of the sequence of interest.
been used for dynamic music annotation by Imbrasaitė et            Our model for a particular set of observations is a condi-
al.[3]. In our experiments we have achieved results that         tional probability distribution with the probability density
clearly outperformed SVR when using standard feature rep-        function:
resentation and produced similar results to using relative
feature representation. It was suspected that the short ex-                                               exp(Ψ)
tracts (only 15s) and little variation in emotion were the                              P (y|x) = R ∞                                          (1)
                                                                                                        −∞
                                                                                                           exp(Ψ)dy
main reasons why the model was not able to achieve better
results. In this paper we are applying the same techniques         We define two types of features in our model: vertex fea-
to a dataset that improves on both accounts with a hope of       tures fk and edge features gk . Our potential function is
clearer results.                                                 defined as:
                                                                            K1
                                                                           XX                                      K2
                                                                                                                  XX
2.    METHOD                                                         Ψ=                αk fk (yi , xi , θ k ) +             βk gk (yi , yj )   (2)
                                                                             i   k=1                              i,j k=1

2.1    Feature extraction and representation                     We constrain αk > 0 and βk > 0, while Θ is unconstrained.
  In our system we used two feature sets. Both feature           The model parameters α = {α1 , α2 , . . . αK1 },
sets were extracted by OpenSMILE using a standard set of         Θ = {θ 1 , θ 2 , . . . θ K1 }, and β = {β1 , β2 , . . . βK2 } are learned
features. As CCNF can suffer when dealing with a large           and used for inference during testing
feature vector and fail to converge, we used a limited set of      The vertex features fk represent the mapping from the xi
statistical descriptors extracted from the features limiting     to yi through a one layer neural network, where θ k is the
the total number of features to 150.                             weight vector for a particular neuron k.
  The first feature set was used as is, in the standard fea-
                                                                                 fk (yi , xi , θ k ) = −(yi − h(θ k , xi ))2                   (3)

Copyright is held by the author/owner(s).                                                                     1
                                                                                          h(θ, xi ) =                                          (4)
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain                                           1 + e−θT xi
Table 1: Results for both the SVR and the CCNF models, using both the standard and the relative feature
representation techniques
                                        Arousal                         Valence
                             rho   range    RMSE   range    rho    range    RMSE    range
               Baseline     0.18  +/-0.36    0.27 +/-0.12   0.11  +/-0.34     0.19 +/-0.11
              Basic SVR     0.129 +/-0.325  0.146 +/-0.062 0.073 +/-0.267    0.100 +/-0.055
             Basic CCNF     0.116 +/-0.632  0.139 +/-0.068 0.063 +/-0.593    0.102 +/-0.064
             Relative SVR   0.148 +/-0.326  0.147 +/-0.064 0.074 +/-0.290    0.099 +/-0.062
            Relative CCNF 0.181 +/-0.604    0.118 +/-0.069 0.066 +/-0.530    0.098 +/-0.062


   The number of vertex features K1 is determined experi-          Secondly, the spread for correlation for CCNF model is twice
mentally during cross-validation, and in our experiments we        as big as the one for SVR, while there is little difference be-
tried K1 = {5, 10, 20, 30}.                                        tween the spread for RMSE for the different methods. In
   The edge features gk represent the similarities between         fact, there is little difference in performance between the
observations yi and yj . This is also affected by the neighbor-    different methods and the different representations used for
hood measure S (k) , which allows us to control the existence      the valence axis.
of such connections.
                                                                   4.   FURTHER INSIGHTS
                                 1 (k)
                gk (yi , yj ) = − Si,j (yi − yj )2 .     (5)         We found it interesting to compare the results achieved
                                 2                                 with this dataset to those achieved with the MoodSwings
   In our linear chain CCNF model, gk enforces smoothness          dataset. This shows how much of an impact the dataset has
between neighboring nodes. We define a single edge feature,        on the performance and even the ranking of different meth-
i.e. K2 = 1. We define S (1) to be 1 only when the two nodes       ods. In our previous work CCNF clearly outperformed SVR
i and j are neighbors in a chain, otherwise it is 0.               with the standard feature representation, while the results
                                                                   with the relative feature representation were comparable be-
2.2.1    Learning and Inference                                    tween the two models. With this dataset, we would have to
   We are given training data {x(q) , y(q) }M
                                            q=1 of M song sam-     draw very different conclusions—with the standard repre-
ples, together with their corresponding dimensional contin-        sentation the results were comparable, if not better for SVR,
uous emotion labels. The dimensions are trained separately,        while there was a clear difference between the two when us-
but all the parameters (α, β and Θ) for each dimension are         ing the relative feature representation for the arousal axis,
optimised jointly.                                                 with CCNF clearly outperforming SVR. This maybe due to
   We convert the Eq.(1) into multivariate Gaussian form.          the fact that there are more training (and testing) samples
It helps with the derivation of the partial derivatives of log-    in this dataset, the extracts are longer and, possibly, better
likelihood, and with the inference.                                suited to the task.
   For learning we can use the constrained limited memory            The valence axis is still proving problematic. The fact that
Broyden-Fletcher-Goldfarb-Shanno algorithm for finding lo-         quite heavyweight techniques are not able to outperform
cally optimal model parameters. We use the standard Mat-           simple models with small feature vectors seems to be in-
lab implementation of the algorithm. In order to make the          dicating that we are approaching the problem from a wrong
optimisation both more accurate and faster we used the par-        angle. Improving results for the valence axis should be the
tial derivatives of the log P (y|x), which are straightforward     top priority for our future work.
to derive and are similar to those of CCRF [2].
   A more thorough description of the model as well as the         5.   REFERENCES
code to reproduce the results can be found at
http://www.cl.cam.ac.uk/research/rainbow/projects/ccnf/            [1] A. Aljanaki, Y.-H. Yang, and M. Soleymani. Emotion
                                                                       in music task at mediaeval 2014. In MediaEval
                                                                       Workshop, 2014.
3.   RESULTS                                                       [2] V. Imbrasaite, T. Baltrusaitis, and P. Robinson.
   In order to get a better understanding of where CCNF                Emotion tracking in music using continuous conditional
stands in terms of performance, we decided to compare it               random fields and relative feature representation. In
to another standard approach used in the field. We used                Proc. of ICME. IEEE, 2013.
Support Vector Regression (SVR) model with the Radial              [3] V. Imbrasaitė, T. Baltrušaitis, and P. Robinson. Ccnf
Basis Function kernel in the same way we used CCNF—we                  for continuous emotion tracking in music: Comparison
trained a model for each axis, using 2-fold cross-validation           with ccrf and relative feature representation. In Proc. of
to pick the best parameters for training. The experimental             ICME. IEEE, 2014.
design was identical to the one used in our previous paper [3],
                                                                   [4] J. A. Speck, E. M. Schmidt, B. G. Morton, and Y. E.
which makes the results comparable not only to the baseline
                                                                       Kim. A comparative study of collaborative vs.
method in this challenge, but also between several datasets.
                                                                       traditional music mood annotation. Proc. of ISMIR,
   There are several interesting trends visible from the results
                                                                       2011.
(see Table 1). First of all, CCNF combined with the rela-
tive feature representation clearly outperforms all the other
methods for the arousal axis, as well as the baseline method.