<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Music Emotion Tracking with Continuous Conditional Neural Fields and Relative Representation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vaiva Imbrasait e˙</string-name>
          <email>Vaiva.Imbrasaite@cl.cam.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter Robinson</string-name>
          <email>Peter.Robinson@cl.cam.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Laboratory, University of Cambridge</institution>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>This working notes paper introduces the system proposed by the Rainbow group for the MediaEval Emotion in Music 2014 task. The task is concerned with predicting dynamic emotion labels for an excerpt of a song and for our approach we use Continuous Conditional Neural Fields and relative feature representation both of which have been developed or adapted by our group.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The Emotion in Music task is concerned with providing
dynamic arousal and valence labels and is described in the
paper by Aljanaki et al.[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        The use of relative feature representation has already been
introduced to the eld of dynamic music annotation and
tested on MoodSwings dataset [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] by Imbrasaite_ et al.[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
They have shown substantial improvement over using
standard feature representation with the standard Support
Vector Regression (SVR) approach as well as comparable
performance to more complicated machine learning techniques
such as Continuous Conditional Random Fields.
      </p>
      <p>
        Continuous Conditional Neural Fields (CCNF) have also
been used for dynamic music annotation by Imbrasaite_ et
al.[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In our experiments we have achieved results that
clearly outperformed SVR when using standard feature
representation and produced similar results to using relative
feature representation. It was suspected that the short
extracts (only 15s) and little variation in emotion were the
main reasons why the model was not able to achieve better
results. In this paper we are applying the same techniques
to a dataset that improves on both accounts with a hope of
clearer results.
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>METHOD</title>
    </sec>
    <sec id="sec-3">
      <title>Feature extraction and representation</title>
      <p>In our system we used two feature sets. Both feature
sets were extracted by OpenSMILE using a standard set of
features. As CCNF can su er when dealing with a large
feature vector and fail to converge, we used a limited set of
statistical descriptors extracted from the features limiting
the total number of features to 150.</p>
      <p>The rst feature set was used as is, in the standard
fea2.2</p>
    </sec>
    <sec id="sec-4">
      <title>CCNF</title>
      <p>Our CCNF model consists of a undirected graphical model
that can model the conditional probability of a continuous
valued vector y (for example emotion in valence space)
depending on continuous x (in this case, audio features).</p>
      <p>In our discussion we will use the following notation: x =
fx1; x2; : : : ; xng is a set of observed input variables, X is a
matrix where the ith column represents xi,
y = fy1; y2; : : : ; yng is a set of output variables that we wish
to predict, xi 2 Rm and yi 2 R (patch expert response), n
is the length of the sequence of interest.</p>
      <p>Our model for a particular set of observations is a
conditional probability distribution with the probability density
function:
exp( )
P (yjx) = R 1 exp( )dy (1)</p>
      <p>1</p>
      <p>We de ne two types of features in our model: vertex
features fk and edge features gk. Our potential function is
de ned as:</p>
      <p>K1
= X X
i k=1</p>
      <p>K2
kfk(yi; xi; k) + X X kgk(yi; yj )
i;j k=1
We constrain k &gt; 0 and k &gt; 0, while is unconstrained.
The model parameters = f 1; 2; : : : K1g,</p>
      <p>= f 1; 2; : : : K1g, and = f 1; 2; : : : K2g are learned
and used for inference during testing</p>
      <p>The vertex features fk represent the mapping from the xi
to yi through a one layer neural network, where k is the
weight vector for a particular neuron k.
h( ; xi) =
1 + e</p>
      <p>T xi
(2)
(3)</p>
      <p>The number of vertex features K1 is determined
experimentally during cross-validation, and in our experiments we
tried K1 = f5; 10; 20; 30g.</p>
      <p>The edge features gk represent the similarities between
observations yi and yj . This is also a ected by the
neighborhood measure S(k), which allows us to control the existence
of such connections.</p>
      <p>gk(yi; yj ) =
21 Si(;kj)(yi
(5)</p>
      <p>In our linear chain CCNF model, gk enforces smoothness
between neighboring nodes. We de ne a single edge feature,
i.e. K2 = 1. We de ne S(1) to be 1 only when the two nodes
i and j are neighbors in a chain, otherwise it is 0.
2.2.1</p>
      <p>Learning and Inference
We are given training data fx(q); y(q) M
gq=1 of M song
samples, together with their corresponding dimensional
continuous emotion labels. The dimensions are trained separately,
but all the parameters ( , and ) for each dimension are
optimised jointly.</p>
      <p>We convert the Eq.(1) into multivariate Gaussian form.
It helps with the derivation of the partial derivatives of
loglikelihood, and with the inference.</p>
      <p>
        For learning we can use the constrained limited memory
Broyden-Fletcher-Goldfarb-Shanno algorithm for nding
locally optimal model parameters. We use the standard
Matlab implementation of the algorithm. In order to make the
optimisation both more accurate and faster we used the
partial derivatives of the log P (yjx), which are straightforward
to derive and are similar to those of CCRF [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>A more thorough description of the model as well as the
code to reproduce the results can be found at
http://www.cl.cam.ac.uk/research/rainbow/projects/ccnf/</p>
    </sec>
    <sec id="sec-5">
      <title>RESULTS</title>
      <p>
        In order to get a better understanding of where CCNF
stands in terms of performance, we decided to compare it
to another standard approach used in the eld. We used
Support Vector Regression (SVR) model with the Radial
Basis Function kernel in the same way we used CCNF|we
trained a model for each axis, using 2-fold cross-validation
to pick the best parameters for training. The experimental
design was identical to the one used in our previous paper [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
which makes the results comparable not only to the baseline
method in this challenge, but also between several datasets.
      </p>
      <p>There are several interesting trends visible from the results
(see Table 1). First of all, CCNF combined with the
relative feature representation clearly outperforms all the other
methods for the arousal axis, as well as the baseline method.
Secondly, the spread for correlation for CCNF model is twice
as big as the one for SVR, while there is little di erence
between the spread for RMSE for the di erent methods. In
fact, there is little di erence in performance between the
di erent methods and the di erent representations used for
the valence axis.
4.</p>
    </sec>
    <sec id="sec-6">
      <title>FURTHER INSIGHTS</title>
      <p>We found it interesting to compare the results achieved
with this dataset to those achieved with the MoodSwings
dataset. This shows how much of an impact the dataset has
on the performance and even the ranking of di erent
methods. In our previous work CCNF clearly outperformed SVR
with the standard feature representation, while the results
with the relative feature representation were comparable
between the two models. With this dataset, we would have to
draw very di erent conclusions|with the standard
representation the results were comparable, if not better for SVR,
while there was a clear di erence between the two when
using the relative feature representation for the arousal axis,
with CCNF clearly outperforming SVR. This maybe due to
the fact that there are more training (and testing) samples
in this dataset, the extracts are longer and, possibly, better
suited to the task.</p>
      <p>The valence axis is still proving problematic. The fact that
quite heavyweight techniques are not able to outperform
simple models with small feature vectors seems to be
indicating that we are approaching the problem from a wrong
angle. Improving results for the valence axis should be the
top priority for our future work.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Aljanaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-H.</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Soleymani</surname>
          </string-name>
          .
          <article-title>Emotion in music task at mediaeval 2014</article-title>
          . In MediaEval Workshop,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Imbrasaite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Baltrusaitis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Robinson</surname>
          </string-name>
          .
          <article-title>Emotion tracking in music using continuous conditional random elds and relative feature representation</article-title>
          .
          <source>In Proc. of ICME. IEEE</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>V.</given-names>
            <surname>Imbrasaite</surname>
          </string-name>
          _,
          <string-name>
            <given-names>T.</given-names>
            <surname>Baltrusaitis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Robinson</surname>
          </string-name>
          .
          <article-title>Ccnf for continuous emotion tracking in music: Comparison with ccrf and relative feature representation</article-title>
          .
          <source>In Proc. of ICME. IEEE</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Speck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. G.</given-names>
            <surname>Morton</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y. E.</given-names>
            <surname>Kim</surname>
          </string-name>
          .
          <article-title>A comparative study of collaborative vs. traditional music mood annotation</article-title>
          .
          <source>Proc. of ISMIR</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>