1. INTRODUCTION

Dynamic Music Emotion Recognition Using Kernel Bayes' Filter

Konstantin Markov

markov@u-aizu.ac.jp 1

Tomoko Matsui

tmatsui@ism.ac.jp 0 0 Department of Statistical Modeling, Institute of Statistical Mathematics , Tokyo , Japan 1 Human Interface Laboratory, The University of Aizu , Fukushima , Japan

2015

14 15

This paper describes the temporal music emotion recognition system developed at the University of Aizu for the Emotion in Music task of the MediaEval 2015 benchmark evaluation campaign. The arousal-valence trajectory prediction is cast as a time series ltering task and is performed using state-space models. A simple and widely used example is the Kalman Filter, however, it is a linear parametric model and has serious limitations. On the other hand, non-linear and non-parametric approaches don't have such drawbacks, but often scale poorly with the number of training data and their dimension. One such method proposed recently is the Kernel Bayes' Filter (KBF). It uses only data Gram matrices and thus works (almost) equally well with data of both low and high dimension. In our experiments, we used the feature set provided by the organizers without any change. All the development data were clustered in six clusters based on the genre information available from the meta-data. For performance comparison, we build three more emotion recognition systems based on the standard Multivariate Linear regression (MLR), Support Vector machine regression (SVR) and Kalman Filter (KF). The results obtained from a 4-fold cross-validation on the development set show that all types of models, except KF, achieved very similar performance, which suggests that they may have reached the upper bound of the feature set discrimination power.

1. INTRODUCTION

Dynamic or continuous emotion estimation is more di cult task and there are several approaches to solve it. The simplest one is to assume that for a relatively short period of time emotion is constant and apply static emotion recognition methods. These include conventional regression methods as well as a combination of classi cation and regression where data are clustered in advance and for each cluster a separate regression model is built. Testing involves initial classi cation step or model selection procedure. A better approach is to consider emotion trajectory as a time varying process and try to track it or use time series modelling techniques involving state-space models (SSM). A popular and simple SSM is the Kalman lter (KF). It is a linear system and is quite fast since it requires just matrix multiplications and its complexity is linear in the number of data. However, the linearity assumption is a big drawback and KF 2.

KERNEL BAYES’ FILTER

Details about the Kernel Baeys Filter can be found in [ 2 ]. Here we provide just the basic notation and the nal update rules. During the KBF training truth values of both observations X = fx1; : : : ; xT g and corresponding state values Y = fy1; : : : ; yT g are required. The prediction and conditioning steps of the standard ltering algorithms can be reformulated with respect to the kernel embeddings. The embedding of the predictive distribution p(xtjy1:t) is denoted as xtjy1:t and is estimated as PiT=1 i (xi), where () is the feature map and t is updated recursively using Dt+1 t+1 = = diag((G + I) 1G~ t);

Dt+1K((Dt+1K)2 + I) 1)Dt+1K:xt+1 (1) Here, G and K are the training states and observations Gram matrices, G~ is a Gram matrix with entries Gij = k(xi; xj+1), and K:xt+1 = (k(x1; xt+1); : : : ; k(xT ; xt+1)). The regularization parameters and are needed to avoid numerical problems during matrix inversion.

There are few kernel functions that can be used with the KBF such as linear, rbf, and polynomial. Their parameters as well as the regularization constants and comprise the set of hyper-parameters of a KBF system. Unfortunately, there is no algorithm for learning those hyper-parameters from data. They have to be set manually and as our experiments showed are critical for obtaining a good performance. 3.

EXPERIMENTS

Using the genre information available from the metadata, we divided all development clips into six clusters roughly corresponding to the following genres:Classical, Electronic,

Jazz-Blues, Rock-Pop, International-Folk, HipHop-SoulRB. The number of clusters was chosen such that the data distribution becomes as uniform as possible.

In order to visualize the relationship between clustered clips and their emotional content, we calculated arousal and valence statistics per clip and Figure 1 shows the distribution of mean AV vectors in the a ect space. Di erent colors represent di erent genres/clusters and the circle size is proportional to the AV standard deviation. As can be seen, there are no clear grouping by genre, though some genres show more compact clouds than others. Both ltering systems, i.e. KF and KBF, were build using this clustering scheme where one model was trained for one genre and tested with the test data from the same genre only. Linear regression and SVR based systems were trained with no regard to genre clusters.

Since there is no validation data set available, we used 4fold cross-validation approach to tune systems' parameters. The SVR and KBF models have hyper-parameters such as kernel function and regularization constants which cannot be learned from data. An unconstrained simplex search method was adopted to nd optimum parameter setting, however, it does not guarantee global maximum and in the case of KBF, it turned out the initial point has a big impact on the nal result.

RESULTS

Before the calculation of the correlation and RMSE performance measures, predicted arousal and valence values as well as the reference values were scaled to t the range [0.5,+0.5]. This similar to the way results were obtained during previous evaluations.

Table 1 shows the performance of the KBF for each genre as well as the total average. For some genres, the results are better, which may be due to di erences in data distributions, but also because of a better hyper-parameter settings. Total averages from all the regression and state-space model based systems are summarised in Table 2.

The results using the o cial test data set are shown in

We described several systems developed at the University of Aizu for the MediaEval'2015 Emotion in Music evaluation campaign. Our focus is on the machine learning part of this very challenging task and, thus, we built and evaluated few systems based on conventional regression techniques as well as on a new non-parametric non-linear approach using Kernel Bayes' Filter state-space system. All used the feature set provided by the challenge organizers. Although the modelling techniques we utilized range from simple linear regression to sophisticated state-space Bayesian lter, there was a negligible di erence in the performance. This suggests that the feature set may not have enough discriminating power to enable non-parametric non-linear models to show their advantages.

ACKNOWLEDGEMENT

Authors would like to thank Dr. Y.Nishiyama from the University of Electro-Communications, Tokyo, for sharing his Matlab Kernel mean toolbox (kmtb).

[1]

Aljanaki ,

Y.-H.

Yang , and

Soleymani . Emotion in music task at mediaeval 2015 . In Working Notes Proceedings of the MediaEval 2015 Workshop , September 2015 .

[2]

Fukumizu ,

Song , and

Gretton . Kernel bayes' rule: Bayesian inference with positive definite kernels . J. Mach. Learn. Res. , 14 ( 1 ): 3753 - 3783 , Dec. 2013 .

[3]

Markov ,

Matrui ,

Septier , and

Peters . Dynamic speech emotion recognition with state-space models . In Proc. EUSIPCO'2015 , pages 2122 - 2126 , 2015 .

[4]

Song ,

Fukumizu , and

Gretton . Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models . Signal Processing Magazine , IEEE, 30 ( 4 ): 98 - 111 , July 2013 .