1. INTRODUCTION

Dynamic Music Emotion Recognition Using State-Space Models

Konstantin Markov

markov@u-aizu.ac.jp 1

Tomoko Matsui

tmatsui@ism.ac.jp 0 0 Department of Statistical Modeling, Institute of Statistical Mathematics , Tokyo , Japan 1 Human Interface Laboratory, The University of Aizu , Fukushima , Japan

2014

16 17

This paper describes the temporal music emotion recognition system developed at the University of Aizu for the Emotion in Music task of the MediaEval 2014 benchmark evaluation campaign. The arousal-valence trajectory prediction is cast as a time series ltering task and is modeled by a statespace models. These models include standard linear model (Kalman lter) as well as novel non-linear, non-parametric Gaussian Processes based dynamic system. The music signal was parametrized using standard features extracted with the Marsyas toolkit. Based on the preliminary results obtained from small random validation set, clear advantage of any feature or model could not be observed.

1. INTRODUCTION

Gaussian Processes (GPs) [ 4 ] are becoming more and more popular in the Machine Learning community for their ability to learn highly non-linear mappings between two continuous data spaces. Previously, we have successfully applied GPs for static music emotion recognition [ 3 ]. Dynamic or continuous emotion estimation is more di cult task and there are several approaches to solve it. The simplest one is to assume that for a relatively short period of time emotion is constant and apply static emotion recognition methods. A better approach is to consider emotion trajectory as a time varying process and try to track it or use time series modeling techniques. In [ 5 ], authors use Kalman lters to model emotion evolution in time for each of four data partitions. For evaluation, KL divergence between the predicted and reference A-V points distributions is measured assuming "perfect" test samples partitioning. Our approach is similar since we also use data partitioning, however, we apply model selection method. In addition, we present novel dynamic music emotion model based on GPs. The task and the database used in this evaluation are described in detail in the Emotion in Music overview paper [ 1 ].

STATE-SPACE MODELS

State-space models (SSM) are widely used in time series analysis, prediction, and modeling. They consist of latent state variable xt 2 Re and observable measurement variable yt 2 Rd which are related as follows: xt yt = f (xt 1) + vt 1 = g(xt) + wt (1) (2) 2.1

Kalman filter

The Kalman lter is a linear SSM where f (x) = Ax and f (x) = Bx with A and B being unknown parameters, and v and w are zero mean Gaussian noises. Thus, both p(xtjxt 1) and p(ytjxt) become Gaussians and simple analytic solution for the ltering and smoothing tasks can be obtained. 2.2

Gaussian Process dynamic system

When f () and g() are modeled by GPs, we get a Gaussian Process dynamic system. Such SSMs have been recently proposed, but lack e cient and commonly adopted algorithms for learning and inference. Availability of A-V values for training, however, makes the learning task easy since each target dimension of f () and g() can be learned independently using GP regression training algorithm. For the inference, however, there is no straightforward solution. One can always opt for Monte Carlo sampling algorithms, but they are notoriously slow. We used the solution proposed in [ 2 ]. It is based on analytic moment matching to derive Gaussian approximations to the ltering and smoothing distributions. 3.

EXPERIMENTS

The development dataset was randomly split into training and validation sets having 600 and 144 clips each. Full crossvalidation scenario was not adopted due to time constraints. 3.1

Feature extraction

Features were extracted from the audio signal which was rst downsampled to 22050 kHz. Using the Marsyas toolkit we obtained features such as mfcc, spfe including zero-crossing rate, spectral ux, centroid, and rollo , and spectral crest factor scf. All feature vectors were calculated from 512 samples frames with no overlap. First order statistics were calculated for windows of 1 sec. with 0.5 sec. overlap. Thus, for the last 30 seconds of each clip there were 61 feature vectors. In addition to these features, we also used the features from the MediaEval2014 baseline system [ 1 ]. 3.2

Data clustering

In a way similar to [ 5 ], we clustered all training clips into four clusters based on their static A-V values. Separate SSMs were trained from each cluster's data. During the test, the trajectory obtained from the model which showed the best match, i.e. the highest likelihood, was taken as the prediction result.

RESULTS

In order to see the e ect of data clustering, we also evaluated linear system trained on all 600 clips. Tables 1 and 2 show the average correlation coe cient as well as the average RMS error with respect to di erent features for Arousal and Valence respectively. As can be seen, clustered multiple models show lower correlation, but smaller RMSE. It is possible that the clustering has reduced the amount of training for each model resulting in less accurate parameter estimation. Table 3 shows results of the GP based system evaluation with multiple models. Single model was not used due to prohibitive memory requirements. Compared to the corresponding multiple model results of the linear system, only Valence shows some improvement.

Using the o cial test set consisting of 1000 clips we were able to evaluate only the Kalman lter base system due to time limitations. Results using the baseline features as well as couple of Marsyas feature sets are presented in Table 4.

CONCLUSIONS

We presented two state-space model based dynamic music emotion recognition systems - one linear and one based on Gaussian Processes. The preliminary results did not show clear advantage of any system or feature set. This is probably due to the small size of the validation set. More detailed experiments involving more data are planned for the future. mfcc spfe baseline mfcc spfe baseline

Corr.Coef.

AROUSAL 0.2735 0.4522 0.1622 0.5754 0.2063 0.5720

VALENCE 0.0469 0.4326 0.0265 0.4378 0.1665 0.5166 0.0743 0.0714 0.0393

[1]

Aljanaki ,

Y.-H.

Yang , and

Soleymani . Emotion in music task at MediaEval 2014 . In MediaEval 2014 Workshop, Barcelona, Spain, Oct 2014 .

Table 2: Kalman lter and linear RTS smoother VALENCE results. 144 clips validation set . Features KF RTS Corr.Coef. RMSE Corr.Coef. RMSE Single model mfcc 0.0411 0.6262 0.0598 0.7082 spfe 0.0332 0.3945 0.0464 0.4710 mfcc+spfe 0.0304 0.6208 0.0725 0.6978 mfcc+scf 0.1545 0.6692 0.1401 0.7231 baseline 0.0753 0.2681 0.0779 0.2996 Multiple models mfcc -0.082 0.1847 -0.042 0.1915 spfe -0.055 0.2353 -0.060 0.2497 mfcc+spfe -0.054 0.1866 -0.068 0.1914 mfcc+scf 0.0149 0.1688 -0.008 0.1703 baseline -0.080 0.2425 -0.058 0 . 2497 Table 3 : GP lter and GP-RTS smoother results .

Multiple models. 144 clips validation set . Features GP-F GP-RTS Corr .Coef. Corr.Coef. RMSE

[2]

Deisenroth ,

Turner ,

Huber ,

Hanebeck , and

Rasmussen . Robust filtering and smoothing with gaussian processes . Automatic Control , IEEE Transactions on, 57 ( 7 ): 1865 - 1871 , 2012 .

[3]

Markov ,

Iwata , and

Matsui . Music emotion recognition using gaussian processes . In Proceedings of the ACM multimedia 2013 workshop on Crowdsourcing for Multimedia , CrowdMM. ACM , 2013 .

[4]

Rasmussen and

Williams . Gaussian Processes for Machine Learning . Adaptive Computation and Machine Learning . The MIT Press, 2006 .

[5]

Schmidt and

Kim . Prediction of time-varying musical mood distributions using kalman filtering . In Machine Learning and Applications (ICMLA) , 2010 Ninth International Conference on, pages 655 - 660 , Dec 2010 .