=Paper=
{{Paper
|id=Vol-1263/paper28
|storemode=property
|title=Dynamic Music Emotion Recognition Using State-Space Models
|pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_28.pdf
|volume=Vol-1263
|dblpUrl=https://dblp.org/rec/conf/mediaeval/MarkovM14
}}
==Dynamic Music Emotion Recognition Using State-Space Models==
<pdf width="1500px">https://ceur-ws.org/Vol-1263/mediaeval2014_submission_28.pdf</pdf>
<pre>
     Dynamic Music Emotion Recognition Using State-Space
                          Models

                        Konstantin Markov                                         Tomoko Matsui
                    Human Interface Laboratory                          Department of Statistical Modeling
                      The University of Aizu                            Institute of Statistical Mathematics
                       Fukushima, Japan                                             Tokyo, Japan
                       markov@u-aizu.ac.jp                                     tmatsui@ism.ac.jp

ABSTRACT                                                          where f () and g() are unknown functions governing tempo-
This paper describes the temporal music emotion recogni-          ral state dynamics and state-to-measurement mapping re-
tion system developed at the University of Aizu for the Emo-      spectively. System and observation noises vt and wt are as-
tion in Music task of the MediaEval 2014 benchmark evalua-        sumed to be independent. Probabilistically, a SSM can also
tion campaign. The arousal-valence trajectory prediction is       be defined using two distributions: p(xt |xt−1 ) and p(yt |xt ).
cast as a time series filtering task and is modeled by a state-   For a sequence of T measurements, the filtering task is to
space models. These models include standard linear model          approximate p(xt |y1:t ), while approximating p(xt |y1:T ) is
(Kalman filter) as well as novel non-linear, non-parametric       the goal of the Rauch-Tung-Striebel (RTS) smoothing.
Gaussian Processes based dynamic system. The music sig-              For continuous music emotion recognition, xt would rep-
nal was parametrized using standard features extracted with       resent the unknown A-V vector and yt would correspond
the Marsyas toolkit. Based on the preliminary results ob-         to feature vector(s). SSM learning in our case is simplified
tained from small random validation set, clear advantage of       since the state A-V labels are given for the training and f ()
any feature or model could not be observed.                       and g() can be learned independently.

1.   INTRODUCTION                                                 2.1    Kalman filter
   Gaussian Processes (GPs) [4] are becoming more and more          The Kalman filter is a linear SSM where f (x) = Ax
popular in the Machine Learning community for their ability       and f (x) = Bx with A and B being unknown parame-
to learn highly non-linear mappings between two continuous        ters, and v and w are zero mean Gaussian noises. Thus,
data spaces. Previously, we have successfully applied GPs         both p(xt |xt−1 ) and p(yt |xt ) become Gaussians and simple
for static music emotion recognition [3]. Dynamic or contin-      analytic solution for the filtering and smoothing tasks can
uous emotion estimation is more difficult task and there are      be obtained.
several approaches to solve it. The simplest one is to assume     2.2    Gaussian Process dynamic system
that for a relatively short period of time emotion is constant       When f () and g() are modeled by GPs, we get a Gaussian
and apply static emotion recognition methods. A better ap-        Process dynamic system. Such SSMs have been recently pro-
proach is to consider emotion trajectory as a time varying        posed, but lack efficient and commonly adopted algorithms
process and try to track it or use time series modeling tech-     for learning and inference. Availability of A-V values for
niques. In [5], authors use Kalman filters to model emotion       training, however, makes the learning task easy since each
evolution in time for each of four data partitions. For eval-     target dimension of f () and g() can be learned independently
uation, KL divergence between the predicted and reference         using GP regression training algorithm. For the inference,
A-V points distributions is measured assuming ”perfect” test      however, there is no straightforward solution. One can al-
samples partitioning. Our approach is similar since we also       ways opt for Monte Carlo sampling algorithms, but they are
use data partitioning, however, we apply model selection          notoriously slow. We used the solution proposed in [2]. It
method. In addition, we present novel dynamic music emo-          is based on analytic moment matching to derive Gaussian
tion model based on GPs. The task and the database used           approximations to the filtering and smoothing distributions.
in this evaluation are described in detail in the Emotion in
Music overview paper [1].
                                                                  3.    EXPERIMENTS
2.   STATE-SPACE MODELS                                             The development dataset was randomly split into training
  State-space models (SSM) are widely used in time series         and validation sets having 600 and 144 clips each. Full cross-
analysis, prediction, and modeling. They consist of latent        validation scenario was not adopted due to time constraints.
state variable xt ∈ Re and observable measurement variable        3.1    Feature extraction
yt ∈ Rd which are related as follows:
                                                                     Features were extracted from the audio signal which was
                  xt    = f (xt−1 ) + vt−1                 (1)    first downsampled to 22050 kHz. Using the Marsyas toolkit
                  yt    = g(xt ) + wt                      (2)    we obtained features such as mfcc, spfe including zero-crossing
                                                                  rate, spectral flux, centroid, and rolloff, and spectral crest
                                                                  factor scf. All feature vectors were calculated from 512 sam-
Copyright is held by the author/owner(s).                         ples frames with no overlap. First order statistics were cal-
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain    culated for windows of 1 sec. with 0.5 sec. overlap. Thus,
Table 1: Kalman filter and linear RTS smoother                    Table 2: Kalman filter and linear RTS smoother VA-
AROUSAL results. 144 clips validation set.                        LENCE results. 144 clips validation set.
  Features        KF                  RTS                           Features           KF                  RTS
           Corr.Coef. RMSE Corr.Coef. RMSE                                    Corr.Coef. RMSE Corr.Coef. RMSE
                  Single model                                                        Single model
    mfcc    0.2062     0.2894    0.1070    0.3008                     mfcc     0.0411      0.6262    0.0598    0.7082
    spfe    0.1976     0.2860    0.0998    0.3109                     spfe     0.0332      0.3945    0.0464    0.4710
 mfcc+spfe  0.2326     0.2378    0.0894    0.2291                  mfcc+spfe   0.0304      0.6208    0.0725    0.6978
  mfcc+scf  0.1171     0.2288    0.1611    0.2188                   mfcc+scf   0.1545      0.6692    0.1401    0.7231
  baseline  0.2791     0.3631    0.1898    0.4027                   baseline   0.0753      0.2681    0.0779    0.2996
                 Multiple models                                                     Multiple models
    mfcc    0.1698     0.1384    0.0991    0.1284                     mfcc      -0.082     0.1847    -0.042    0.1915
    spfe    0.0957     0.1874    0.0292    0.1772                     spfe      -0.055     0.2353    -0.060    0.2497
 mfcc+spfe  0.2022     0.1290    0.1246    0.1277                  mfcc+spfe    -0.054     0.1866    -0.068    0.1914
  mfcc+scf  0.0059     0.1613    0.0253    0.1615                   mfcc+scf   0.0149      0.1688    -0.008    0.1703
  baseline  0.0212     0.2276    0.0236    0.2259                   baseline    -0.080     0.2425    -0.058    0.2497


for the last 30 seconds of each clip there were 61 feature vec-   Table 3: GP filter and GP-RTS smoother results.
tors. In addition to these features, we also used the features    Multiple models. 144 clips validation set.
from the MediaEval2014 baseline system [1].                         Features        GP-F               GP-RTS
                                                                             Corr.Coef. RMSE Corr.Coef. RMSE
3.2    Data clustering                                                                AROUSAL
  In a way similar to [5], we clustered all training clips into       mfcc     0.0436    0.3088    0.0743    0.3207
four clusters based on their static A-V values. Separate              spfe     0.0582    0.3048    0.0714    0.3486
SSMs were trained from each cluster’s data. During the              baseline  -0.0073    0.3025    0.0393    0.3444
test, the trajectory obtained from the model which showed                             VALENCE
the best match, i.e. the highest likelihood, was taken as the         mfcc     0.0217    0.2766    0.0313    0.3083
prediction result.                                                    spfe     0.0283    0.3297    -0.003    0.3515
                                                                    baseline   -0.011    0.3891    -0.020    0.4431
4.    RESULTS
   In order to see the effect of data clustering, we also eval-
uated linear system trained on all 600 clips. Tables 1 and 2      [2] M. Deisenroth, R. Turner, M. Huber, U. Hanebeck, and
show the average correlation coefficient as well as the aver-         C. Rasmussen. Robust filtering and smoothing with gaussian
age RMS error with respect to different features for Arousal          processes. Automatic Control, IEEE Transactions on,
and Valence respectively. As can be seen, clustered mul-              57(7):1865–1871, 2012.
tiple models show lower correlation, but smaller RMSE. It         [3] K. Markov, M. Iwata, and T. Matsui. Music emotion
is possible that the clustering has reduced the amount of             recognition using gaussian processes. In Proceedings of the
training for each model resulting in less accurate parameter          ACM multimedia 2013 workshop on Crowdsourcing for
estimation. Table 3 shows results of the GP based system              Multimedia, CrowdMM. ACM, 2013.
evaluation with multiple models. Single model was not used        [4] C. Rasmussen and C. Williams. Gaussian Processes for
due to prohibitive memory requirements. Compared to the               Machine Learning. Adaptive Computation and Machine
corresponding multiple model results of the linear system,            Learning. The MIT Press, 2006.
only Valence shows some improvement.                              [5] E. Schmidt and Y. Kim. Prediction of time-varying musical
   Using the official test set consisting of 1000 clips we were       mood distributions using kalman filtering. In Machine
able to evaluate only the Kalman filter base system due to            Learning and Applications (ICMLA), 2010 Ninth
time limitations. Results using the baseline features as well         International Conference on, pages 655–660, Dec 2010.
as couple of Marsyas feature sets are presented in Table 4.

5.    CONCLUSIONS
   We presented two state-space model based dynamic music         Table 4: Kalman filter results using the 1000 clips
emotion recognition systems - one linear and one based on         test set.
Gaussian Processes. The preliminary results did not show                  Features   Corr.Coef.      RMSE
clear advantage of any system or feature set. This is proba-                           AROUSAL
bly due to the small size of the validation set. More detailed           mfcc+spfe 0.2735±0.4522 0.3733±0.1027
experiments involving more data are planned for the future.              mfcc+scf 0.1622±0.5754 0.3541±0.0990
                                                                          baseline 0.2063±0.5720 0.0804±0.0505
6.    REFERENCES                                                                       VALENCE
[1] A. Aljanaki, Y.-H. Yang, and M. Soleymani. Emotion in music          mfcc+spfe 0.0469±0.4326 0.2002±0.0971
    task at MediaEval 2014. In MediaEval 2014 Workshop,                  mfcc+scf 0.0265±0.4378 0.1338±0.0806
    Barcelona, Spain, Oct 2014.                                           baseline 0.1665±0.5166 0.1385±0.0723

</pre>