Dynamic Music Emotion Recognition Using Kernel Bayes’ Filter Konstantin Markov Tomoko Matsui Human Interface Laboratory Department of Statistical Modeling The University of Aizu Institute of Statistical Mathematics Fukushima, Japan Tokyo, Japan markov@u-aizu.ac.jp tmatsui@ism.ac.jp ABSTRACT performs poorly when the data relationship is non-linear. This paper describes the temporal music emotion recogni- Non-parametric non-linear kernel models [4] are becom- tion system developed at the University of Aizu for the Emo- ing more and more popular in the Machine Learning com- tion in Music task of the MediaEval 2015 benchmark eval- munity for their ability to learn highly non-linear mappings uation campaign. The arousal-valence trajectory prediction between two continuous data spaces. They extend the con- is cast as a time series filtering task and is performed using ventional kernel data mapping into high dimensional spaces state-space models. A simple and widely used example is to embedding data distributions in such spaces. This allows the Kalman Filter, however, it is a linear parametric model for Bayesian reasoning and developing inference algorithms and has serious limitations. On the other hand, non-linear which, however, involve only Gram matrices manipulations. and non-parametric approaches don’t have such drawbacks, Although, the complexity is O(n3 ) because of matrix inver- but often scale poorly with the number of training data and sions, it does not depend on data dimensionality, which is a their dimension. One such method proposed recently is the big advantage compared to other non-linear methods based Kernel Bayes’ Filter (KBF). It uses only data Gram matrices on Monte Carlo sampling [3]. and thus works (almost) equally well with data of both low The task and the database used in this evaluation are and high dimension. In our experiments, we used the feature described in detail in the task overview paper [1]. set provided by the organizers without any change. All the development data were clustered in six clusters based on the 2. KERNEL BAYES’ FILTER genre information available from the meta-data. For perfor- Details about the Kernel Baeys Filter can be found in [2]. mance comparison, we build three more emotion recogni- Here we provide just the basic notation and the final update tion systems based on the standard Multivariate Linear re- rules. During the KBF training truth values of both obser- gression (MLR), Support Vector machine regression (SVR) vations X = {x1 , . . . , xT } and corresponding state values and Kalman Filter (KF). The results obtained from a 4-fold Y = {y1 , . . . , yT } are required. The prediction and con- cross-validation on the development set show that all types ditioning steps of the standard filtering algorithms can be of models, except KF, achieved very similar performance, reformulated with respect to the kernel embeddings. The which suggests that they may have reached the upper bound embedding of the predictive distribution p(xt |y1:t ) is de- of the feature set discrimination power. noted as µxt |y1:t and is estimated as PT i=1 αi φ(xi ), where φ() is the feature map and αt is updated recursively using 1. INTRODUCTION Dt+1 = diag((G + λI)−1 G̃αt ), Dynamic or continuous emotion estimation is more diffi- cult task and there are several approaches to solve it. The αt+1 = Dt+1 K((Dt+1 K)2 + βI)−1 )Dt+1 K:xt+1 (1) simplest one is to assume that for a relatively short period of Here, G and K are the training states and observations time emotion is constant and apply static emotion recogni- Gram matrices, G̃ is a Gram matrix with entries Gij = tion methods. These include conventional regression meth- k(xi , xj+1 ), and K:xt+1 = (k(x1 , xt+1 ), . . . , k(xT , xt+1 )). The ods as well as a combination of classification and regression regularization parameters λ and β are needed to avoid nu- where data are clustered in advance and for each cluster a merical problems during matrix inversion. separate regression model is built. Testing involves initial There are few kernel functions that can be used with the classification step or model selection procedure. A better KBF such as linear, rbf, and polynomial. Their parameters approach is to consider emotion trajectory as a time vary- as well as the regularization constants λ and β comprise the ing process and try to track it or use time series modelling set of hyper-parameters of a KBF system. Unfortunately, techniques involving state-space models (SSM). A popular there is no algorithm for learning those hyper-parameters and simple SSM is the Kalman filter (KF). It is a linear from data. They have to be set manually and as our experi- system and is quite fast since it requires just matrix multi- ments showed are critical for obtaining a good performance. plications and its complexity is linear in the number of data. However, the linearity assumption is a big drawback and KF 3. EXPERIMENTS Using the genre information available from the metadata, Copyright is held by the author/owner(s). MediaEval 2015 Workshop, September 14-15, 2015, Wurzen, Ger- we divided all development clips into six clusters roughly many corresponding to the following genres:Classical, Electronic, Table 1: Kernel Bayes’ filter results on the develop- ment set. Genre Arousal Valence R RMSE R RMSE Classical 0.282 0.355 0.132 0.390 Electronic 0.306 0.347 0.265 0.355 Jazz-Blues 0.367 0.357 0.192 0.372 Rock-Pop 0.350 0.342 0.167 0.382 International 0.219 0.365 0.207 0.371 Hip-Hop, SoulRB 0.307 0.342 0.204 0.348 Average 0.305 0.351 0.194 0.369 Table 3. Due to time limitations, the KBF system uses re- duced (to one forth) training set data which apparently has negative effect on the performance. Since the reference and predicted AV values are scaled to [-1.0, 1.0], direct compar- Figure 1: Mean A-V values distribution for the de- ison of the RMSE scores with those from previous tables is velopment data. Different colors represent differ- possible when they are divided by 2. ent genres. Circle sizes are proportional to the A-V standard deviation. Table 2: Averaged results of all systems on the de- velopment set. Jazz-Blues, Rock-Pop, International-Folk, HipHop-SoulRB. Genre Arousal Valence The number of clusters was chosen such that the data dis- R RMSE R RMSE tribution becomes as uniform as possible. Regression In order to visualize the relationship between clustered Linear 0.269 0.341 0.184 0.357 clips and their emotional content, we calculated arousal and SVM 0.283 0.340 0.214 0.351 valence statistics per clip and Figure 1 shows the distribution Filters of mean AV vectors in the affect space. Different colors rep- Kalman 0.113 0.390 0.068 0.393 resent different genres/clusters and the circle size is propor- Kernel Bayes’ 0.305 0.351 0.194 0.369 tional to the AV standard deviation. As can be seen, there are no clear grouping by genre, though some genres show more compact clouds than others. Both filtering systems, i.e. KF and KBF, were build using this clustering scheme Table 3: Averaged results on the test set. where one model was trained for one genre and tested with Genre Arousal Valence the test data from the same genre only. Linear regression R RMSE R RMSE and SVR based systems were trained with no regard to genre SVR 0.490 0.446 -0.019 0.542 clusters. KBR 0.419 0.498 -0.035 0.620 Since there is no validation data set available, we used 4- fold cross-validation approach to tune systems’ parameters. The SVR and KBF models have hyper-parameters such as kernel function and regularization constants which cannot be 5. CONCLUSIONS learned from data. An unconstrained simplex search method We described several systems developed at the University was adopted to find optimum parameter setting, however, it of Aizu for the MediaEval’2015 Emotion in Music evalua- does not guarantee global maximum and in the case of KBF, tion campaign. Our focus is on the machine learning part of it turned out the initial point has a big impact on the final this very challenging task and, thus, we built and evaluated result. few systems based on conventional regression techniques as well as on a new non-parametric non-linear approach us- ing Kernel Bayes’ Filter state-space system. All used the 4. RESULTS feature set provided by the challenge organizers. Although Before the calculation of the correlation and RMSE per- the modelling techniques we utilized range from simple lin- formance measures, predicted arousal and valence values as ear regression to sophisticated state-space Bayesian filter, well as the reference values were scaled to fit the range [- there was a negligible difference in the performance. This 0.5,+0.5]. This similar to the way results were obtained suggests that the feature set may not have enough discrimi- during previous evaluations. nating power to enable non-parametric non-linear models to Table 1 shows the performance of the KBF for each genre show their advantages. as well as the total average. For some genres, the results are better, which may be due to differences in data distributions, 6. ACKNOWLEDGEMENT but also because of a better hyper-parameter settings. Total Authors would like to thank Dr. Y.Nishiyama from the averages from all the regression and state-space model based University of Electro-Communications, Tokyo, for sharing systems are summarised in Table 2. his Matlab Kernel mean toolbox (kmtb). The results using the official test data set are shown in 7. REFERENCES [1] A. Aljanaki, Y.-H. Yang, and M. Soleymani. Emotion in music task at mediaeval 2015. In Working Notes Proceedings of the MediaEval 2015 Workshop, September 2015. [2] K. Fukumizu, L. Song, and A. Gretton. Kernel bayes’ rule: Bayesian inference with positive definite kernels. J. Mach. Learn. Res., 14(1):3753–3783, Dec. 2013. [3] K. Markov, T. Matrui, F. Septier, and G. Peters. Dynamic speech emotion recognition with state-space models. In Proc. EUSIPCO’2015, pages 2122–2126, 2015. [4] L. Song, K. Fukumizu, and A. Gretton. Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models. Signal Processing Magazine, IEEE, 30(4):98–111, July 2013.