<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Dynamic Music Emotion Recognition Using State-Space Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Konstantin Markov</string-name>
          <email>markov@u-aizu.ac.jp</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomoko Matsui</string-name>
          <email>tmatsui@ism.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Statistical Modeling, Institute of Statistical Mathematics</institution>
          ,
          <addr-line>Tokyo</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Human Interface Laboratory, The University of Aizu</institution>
          ,
          <addr-line>Fukushima</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>This paper describes the temporal music emotion recognition system developed at the University of Aizu for the Emotion in Music task of the MediaEval 2014 benchmark evaluation campaign. The arousal-valence trajectory prediction is cast as a time series ltering task and is modeled by a statespace models. These models include standard linear model (Kalman lter) as well as novel non-linear, non-parametric Gaussian Processes based dynamic system. The music signal was parametrized using standard features extracted with the Marsyas toolkit. Based on the preliminary results obtained from small random validation set, clear advantage of any feature or model could not be observed.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Gaussian Processes (GPs) [
        <xref ref-type="bibr" rid="ref6">4</xref>
        ] are becoming more and more
popular in the Machine Learning community for their ability
to learn highly non-linear mappings between two continuous
data spaces. Previously, we have successfully applied GPs
for static music emotion recognition [
        <xref ref-type="bibr" rid="ref5">3</xref>
        ]. Dynamic or
continuous emotion estimation is more di cult task and there are
several approaches to solve it. The simplest one is to assume
that for a relatively short period of time emotion is constant
and apply static emotion recognition methods. A better
approach is to consider emotion trajectory as a time varying
process and try to track it or use time series modeling
techniques. In [
        <xref ref-type="bibr" rid="ref7">5</xref>
        ], authors use Kalman lters to model emotion
evolution in time for each of four data partitions. For
evaluation, KL divergence between the predicted and reference
A-V points distributions is measured assuming "perfect" test
samples partitioning. Our approach is similar since we also
use data partitioning, however, we apply model selection
method. In addition, we present novel dynamic music
emotion model based on GPs. The task and the database used
in this evaluation are described in detail in the Emotion in
Music overview paper [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>STATE-SPACE MODELS</title>
      <p>State-space models (SSM) are widely used in time series
analysis, prediction, and modeling. They consist of latent
state variable xt 2 Re and observable measurement variable
yt 2 Rd which are related as follows:
xt
yt
= f (xt 1) + vt 1
= g(xt) + wt
(1)
(2)
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Kalman filter</title>
      <p>The Kalman lter is a linear SSM where f (x) = Ax
and f (x) = Bx with A and B being unknown
parameters, and v and w are zero mean Gaussian noises. Thus,
both p(xtjxt 1) and p(ytjxt) become Gaussians and simple
analytic solution for the ltering and smoothing tasks can
be obtained.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Gaussian Process dynamic system</title>
      <p>
        When f () and g() are modeled by GPs, we get a Gaussian
Process dynamic system. Such SSMs have been recently
proposed, but lack e cient and commonly adopted algorithms
for learning and inference. Availability of A-V values for
training, however, makes the learning task easy since each
target dimension of f () and g() can be learned independently
using GP regression training algorithm. For the inference,
however, there is no straightforward solution. One can
always opt for Monte Carlo sampling algorithms, but they are
notoriously slow. We used the solution proposed in [
        <xref ref-type="bibr" rid="ref2 ref4">2</xref>
        ]. It
is based on analytic moment matching to derive Gaussian
approximations to the ltering and smoothing distributions.
3.
      </p>
    </sec>
    <sec id="sec-5">
      <title>EXPERIMENTS</title>
      <p>The development dataset was randomly split into training
and validation sets having 600 and 144 clips each. Full
crossvalidation scenario was not adopted due to time constraints.
3.1</p>
    </sec>
    <sec id="sec-6">
      <title>Feature extraction</title>
      <p>
        Features were extracted from the audio signal which was
rst downsampled to 22050 kHz. Using the Marsyas toolkit
we obtained features such as mfcc, spfe including zero-crossing
rate, spectral ux, centroid, and rollo , and spectral crest
factor scf. All feature vectors were calculated from 512
samples frames with no overlap. First order statistics were
calculated for windows of 1 sec. with 0.5 sec. overlap. Thus,
for the last 30 seconds of each clip there were 61 feature
vectors. In addition to these features, we also used the features
from the MediaEval2014 baseline system [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
3.2
      </p>
    </sec>
    <sec id="sec-7">
      <title>Data clustering</title>
      <p>
        In a way similar to [
        <xref ref-type="bibr" rid="ref7">5</xref>
        ], we clustered all training clips into
four clusters based on their static A-V values. Separate
SSMs were trained from each cluster's data. During the
test, the trajectory obtained from the model which showed
the best match, i.e. the highest likelihood, was taken as the
prediction result.
      </p>
    </sec>
    <sec id="sec-8">
      <title>RESULTS</title>
      <p>In order to see the e ect of data clustering, we also
evaluated linear system trained on all 600 clips. Tables 1 and 2
show the average correlation coe cient as well as the
average RMS error with respect to di erent features for Arousal
and Valence respectively. As can be seen, clustered
multiple models show lower correlation, but smaller RMSE. It
is possible that the clustering has reduced the amount of
training for each model resulting in less accurate parameter
estimation. Table 3 shows results of the GP based system
evaluation with multiple models. Single model was not used
due to prohibitive memory requirements. Compared to the
corresponding multiple model results of the linear system,
only Valence shows some improvement.</p>
      <p>Using the o cial test set consisting of 1000 clips we were
able to evaluate only the Kalman lter base system due to
time limitations. Results using the baseline features as well
as couple of Marsyas feature sets are presented in Table 4.</p>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSIONS</title>
      <p>We presented two state-space model based dynamic music
emotion recognition systems - one linear and one based on
Gaussian Processes. The preliminary results did not show
clear advantage of any system or feature set. This is
probably due to the small size of the validation set. More detailed
experiments involving more data are planned for the future.
mfcc
spfe
baseline
mfcc
spfe
baseline</p>
      <p>Corr.Coef.</p>
      <p>AROUSAL
0.2735 0.4522
0.1622 0.5754
0.2063 0.5720</p>
      <p>VALENCE
0.0469 0.4326
0.0265 0.4378
0.1665 0.5166
0.0743
0.0714
0.0393</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Aljanaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-H.</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Soleymani</surname>
          </string-name>
          .
          <article-title>Emotion in music task at MediaEval 2014</article-title>
          . In MediaEval 2014 Workshop, Barcelona, Spain, Oct
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>Table 2: Kalman lter and linear RTS smoother VALENCE results. 144 clips validation set</article-title>
          .
          <source>Features KF RTS Corr.Coef. RMSE Corr.Coef. RMSE Single model mfcc 0.0411 0.6262 0.0598 0.7082 spfe 0.0332 0.3945 0.0464 0.4710 mfcc+spfe 0.0304 0.6208 0.0725 0.6978 mfcc+scf 0.1545 0.6692 0.1401 0.7231 baseline 0.0753 0.2681 0.0779 0.2996 Multiple models mfcc -0.082 0.1847 -0.042 0.1915 spfe -0.055 0.2353 -0.060 0.2497 mfcc+spfe -0.054 0.1866 -0.068 0.1914 mfcc+scf 0.0149 0.1688 -0.008 0.1703 baseline -0.080 0.2425 -0.058 0</source>
          .
          <issue>2497 Table 3</issue>
          :
          <article-title>GP lter and GP-RTS smoother results</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>Multiple models. 144 clips validation set</article-title>
          . Features
          <string-name>
            <surname>GP-F GP-RTS Corr</surname>
          </string-name>
          .Coef. Corr.Coef. RMSE
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Deisenroth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Turner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Huber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Hanebeck</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Rasmussen</surname>
          </string-name>
          .
          <article-title>Robust filtering and smoothing with gaussian processes</article-title>
          .
          <source>Automatic Control</source>
          , IEEE Transactions on,
          <volume>57</volume>
          (
          <issue>7</issue>
          ):
          <fpage>1865</fpage>
          -
          <lpage>1871</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Markov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Iwata</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Matsui</surname>
          </string-name>
          .
          <article-title>Music emotion recognition using gaussian processes</article-title>
          .
          <source>In Proceedings of the ACM multimedia 2013 workshop on Crowdsourcing for Multimedia</source>
          ,
          <source>CrowdMM. ACM</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rasmussen</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Williams</surname>
          </string-name>
          .
          <article-title>Gaussian Processes for Machine Learning</article-title>
          .
          <source>Adaptive Computation and Machine Learning</source>
          . The MIT Press,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          .
          <article-title>Prediction of time-varying musical mood distributions using kalman filtering</article-title>
          .
          <source>In Machine Learning and Applications (ICMLA)</source>
          ,
          <year>2010</year>
          Ninth International Conference on, pages
          <fpage>655</fpage>
          -
          <lpage>660</lpage>
          ,
          <year>Dec 2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>