<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Dynamic Music Emotion Recognition Using Kernel Bayes' Filter</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Konstantin Markov</string-name>
          <email>markov@u-aizu.ac.jp</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomoko Matsui</string-name>
          <email>tmatsui@ism.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Statistical Modeling, Institute of Statistical Mathematics</institution>
          ,
          <addr-line>Tokyo</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Human Interface Laboratory, The University of Aizu</institution>
          ,
          <addr-line>Fukushima</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper describes the temporal music emotion recognition system developed at the University of Aizu for the Emotion in Music task of the MediaEval 2015 benchmark evaluation campaign. The arousal-valence trajectory prediction is cast as a time series ltering task and is performed using state-space models. A simple and widely used example is the Kalman Filter, however, it is a linear parametric model and has serious limitations. On the other hand, non-linear and non-parametric approaches don't have such drawbacks, but often scale poorly with the number of training data and their dimension. One such method proposed recently is the Kernel Bayes' Filter (KBF). It uses only data Gram matrices and thus works (almost) equally well with data of both low and high dimension. In our experiments, we used the feature set provided by the organizers without any change. All the development data were clustered in six clusters based on the genre information available from the meta-data. For performance comparison, we build three more emotion recognition systems based on the standard Multivariate Linear regression (MLR), Support Vector machine regression (SVR) and Kalman Filter (KF). The results obtained from a 4-fold cross-validation on the development set show that all types of models, except KF, achieved very similar performance, which suggests that they may have reached the upper bound of the feature set discrimination power.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>Dynamic or continuous emotion estimation is more di
cult task and there are several approaches to solve it. The
simplest one is to assume that for a relatively short period of
time emotion is constant and apply static emotion
recognition methods. These include conventional regression
methods as well as a combination of classi cation and regression
where data are clustered in advance and for each cluster a
separate regression model is built. Testing involves initial
classi cation step or model selection procedure. A better
approach is to consider emotion trajectory as a time
varying process and try to track it or use time series modelling
techniques involving state-space models (SSM). A popular
and simple SSM is the Kalman lter (KF). It is a linear
system and is quite fast since it requires just matrix
multiplications and its complexity is linear in the number of data.
However, the linearity assumption is a big drawback and KF
2.</p>
    </sec>
    <sec id="sec-2">
      <title>KERNEL BAYES’ FILTER</title>
      <p>
        Details about the Kernel Baeys Filter can be found in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
Here we provide just the basic notation and the nal update
rules. During the KBF training truth values of both
observations X = fx1; : : : ; xT g and corresponding state values
Y = fy1; : : : ; yT g are required. The prediction and
conditioning steps of the standard ltering algorithms can be
reformulated with respect to the kernel embeddings. The
embedding of the predictive distribution p(xtjy1:t) is
denoted as xtjy1:t and is estimated as PiT=1 i (xi), where
() is the feature map and t is updated recursively using
Dt+1
t+1
=
=
diag((G + I) 1G~ t);
      </p>
      <p>Dt+1K((Dt+1K)2 + I) 1)Dt+1K:xt+1 (1)
Here, G and K are the training states and observations
Gram matrices, G~ is a Gram matrix with entries Gij =
k(xi; xj+1), and K:xt+1 = (k(x1; xt+1); : : : ; k(xT ; xt+1)). The
regularization parameters and are needed to avoid
numerical problems during matrix inversion.</p>
      <p>There are few kernel functions that can be used with the
KBF such as linear, rbf, and polynomial. Their parameters
as well as the regularization constants and comprise the
set of hyper-parameters of a KBF system. Unfortunately,
there is no algorithm for learning those hyper-parameters
from data. They have to be set manually and as our
experiments showed are critical for obtaining a good performance.
3.</p>
    </sec>
    <sec id="sec-3">
      <title>EXPERIMENTS</title>
      <p>Using the genre information available from the metadata,
we divided all development clips into six clusters roughly
corresponding to the following genres:Classical, Electronic,</p>
      <p>Jazz-Blues, Rock-Pop, International-Folk, HipHop-SoulRB.
The number of clusters was chosen such that the data
distribution becomes as uniform as possible.</p>
      <p>In order to visualize the relationship between clustered
clips and their emotional content, we calculated arousal and
valence statistics per clip and Figure 1 shows the distribution
of mean AV vectors in the a ect space. Di erent colors
represent di erent genres/clusters and the circle size is
proportional to the AV standard deviation. As can be seen, there
are no clear grouping by genre, though some genres show
more compact clouds than others. Both ltering systems,
i.e. KF and KBF, were build using this clustering scheme
where one model was trained for one genre and tested with
the test data from the same genre only. Linear regression
and SVR based systems were trained with no regard to genre
clusters.</p>
      <p>Since there is no validation data set available, we used
4fold cross-validation approach to tune systems' parameters.
The SVR and KBF models have hyper-parameters such as
kernel function and regularization constants which cannot be
learned from data. An unconstrained simplex search method
was adopted to nd optimum parameter setting, however, it
does not guarantee global maximum and in the case of KBF,
it turned out the initial point has a big impact on the nal
result.</p>
    </sec>
    <sec id="sec-4">
      <title>RESULTS</title>
      <p>Before the calculation of the correlation and RMSE
performance measures, predicted arousal and valence values as
well as the reference values were scaled to t the range
[0.5,+0.5]. This similar to the way results were obtained
during previous evaluations.</p>
      <p>Table 1 shows the performance of the KBF for each genre
as well as the total average. For some genres, the results are
better, which may be due to di erences in data distributions,
but also because of a better hyper-parameter settings. Total
averages from all the regression and state-space model based
systems are summarised in Table 2.</p>
      <p>The results using the o cial test data set are shown in</p>
      <p>We described several systems developed at the University
of Aizu for the MediaEval'2015 Emotion in Music
evaluation campaign. Our focus is on the machine learning part of
this very challenging task and, thus, we built and evaluated
few systems based on conventional regression techniques as
well as on a new non-parametric non-linear approach
using Kernel Bayes' Filter state-space system. All used the
feature set provided by the challenge organizers. Although
the modelling techniques we utilized range from simple
linear regression to sophisticated state-space Bayesian lter,
there was a negligible di erence in the performance. This
suggests that the feature set may not have enough
discriminating power to enable non-parametric non-linear models to
show their advantages.</p>
    </sec>
    <sec id="sec-5">
      <title>ACKNOWLEDGEMENT</title>
      <p>Authors would like to thank Dr. Y.Nishiyama from the
University of Electro-Communications, Tokyo, for sharing
his Matlab Kernel mean toolbox (kmtb).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Aljanaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-H.</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Soleymani</surname>
          </string-name>
          .
          <article-title>Emotion in music task at mediaeval 2015</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2015 Workshop</source>
          ,
          <year>September 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Fukumizu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Gretton</surname>
          </string-name>
          .
          <article-title>Kernel bayes' rule: Bayesian inference with positive definite kernels</article-title>
          .
          <source>J. Mach. Learn. Res.</source>
          ,
          <volume>14</volume>
          (
          <issue>1</issue>
          ):
          <fpage>3753</fpage>
          -
          <lpage>3783</lpage>
          , Dec.
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Markov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Matrui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Septier</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Peters</surname>
          </string-name>
          .
          <article-title>Dynamic speech emotion recognition with state-space models</article-title>
          .
          <source>In Proc. EUSIPCO'2015</source>
          , pages
          <fpage>2122</fpage>
          -
          <lpage>2126</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Fukumizu</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Gretton</surname>
          </string-name>
          .
          <article-title>Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models</article-title>
          .
          <source>Signal Processing Magazine</source>
          , IEEE,
          <volume>30</volume>
          (
          <issue>4</issue>
          ):
          <fpage>98</fpage>
          -
          <lpage>111</lpage>
          ,
          <year>July 2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>