=Paper=
{{Paper
|id=Vol-1436/Paper75
|storemode=property
|title=Dynamic Music Emotion Recognition Using Kernel Bayes’ Filter
|pdfUrl=https://ceur-ws.org/Vol-1436/Paper75.pdf
|volume=Vol-1436
|dblpUrl=https://dblp.org/rec/conf/mediaeval/MarkovM15
}}
==Dynamic Music Emotion Recognition Using Kernel Bayes’ Filter==
<pdf width="1500px">https://ceur-ws.org/Vol-1436/Paper75.pdf</pdf>
<pre>
 Dynamic Music Emotion Recognition Using Kernel Bayes’
                        Filter

                       Konstantin Markov                                           Tomoko Matsui
                    Human Interface Laboratory                           Department of Statistical Modeling
                      The University of Aizu                             Institute of Statistical Mathematics
                       Fukushima, Japan                                              Tokyo, Japan
                     markov@u-aizu.ac.jp                                         tmatsui@ism.ac.jp

ABSTRACT                                                          performs poorly when the data relationship is non-linear.
This paper describes the temporal music emotion recogni-             Non-parametric non-linear kernel models [4] are becom-
tion system developed at the University of Aizu for the Emo-      ing more and more popular in the Machine Learning com-
tion in Music task of the MediaEval 2015 benchmark eval-          munity for their ability to learn highly non-linear mappings
uation campaign. The arousal-valence trajectory prediction        between two continuous data spaces. They extend the con-
is cast as a time series filtering task and is performed using    ventional kernel data mapping into high dimensional spaces
state-space models. A simple and widely used example is           to embedding data distributions in such spaces. This allows
the Kalman Filter, however, it is a linear parametric model       for Bayesian reasoning and developing inference algorithms
and has serious limitations. On the other hand, non-linear        which, however, involve only Gram matrices manipulations.
and non-parametric approaches don’t have such drawbacks,          Although, the complexity is O(n3 ) because of matrix inver-
but often scale poorly with the number of training data and       sions, it does not depend on data dimensionality, which is a
their dimension. One such method proposed recently is the         big advantage compared to other non-linear methods based
Kernel Bayes’ Filter (KBF). It uses only data Gram matrices       on Monte Carlo sampling [3].
and thus works (almost) equally well with data of both low           The task and the database used in this evaluation are
and high dimension. In our experiments, we used the feature       described in detail in the task overview paper [1].
set provided by the organizers without any change. All the
development data were clustered in six clusters based on the      2.    KERNEL BAYES’ FILTER
genre information available from the meta-data. For perfor-         Details about the Kernel Baeys Filter can be found in [2].
mance comparison, we build three more emotion recogni-            Here we provide just the basic notation and the final update
tion systems based on the standard Multivariate Linear re-        rules. During the KBF training truth values of both obser-
gression (MLR), Support Vector machine regression (SVR)           vations X = {x1 , . . . , xT } and corresponding state values
and Kalman Filter (KF). The results obtained from a 4-fold        Y = {y1 , . . . , yT } are required. The prediction and con-
cross-validation on the development set show that all types       ditioning steps of the standard filtering algorithms can be
of models, except KF, achieved very similar performance,          reformulated with respect to the kernel embeddings. The
which suggests that they may have reached the upper bound         embedding of the predictive distribution p(xt |y1:t ) is de-
of the feature set discrimination power.                          noted as µxt |y1:t and is estimated as
                                                                                                           PT
                                                                                                             i=1 αi φ(xi ), where
                                                                  φ() is the feature map and αt is updated recursively using
1.   INTRODUCTION
                                                                       Dt+1    = diag((G + λI)−1 G̃αt ),
   Dynamic or continuous emotion estimation is more diffi-
cult task and there are several approaches to solve it. The            αt+1    = Dt+1 K((Dt+1 K)2 + βI)−1 )Dt+1 K:xt+1 (1)
simplest one is to assume that for a relatively short period of   Here, G and K are the training states and observations
time emotion is constant and apply static emotion recogni-        Gram matrices, G̃ is a Gram matrix with entries Gij =
tion methods. These include conventional regression meth-         k(xi , xj+1 ), and K:xt+1 = (k(x1 , xt+1 ), . . . , k(xT , xt+1 )). The
ods as well as a combination of classification and regression     regularization parameters λ and β are needed to avoid nu-
where data are clustered in advance and for each cluster a        merical problems during matrix inversion.
separate regression model is built. Testing involves initial         There are few kernel functions that can be used with the
classification step or model selection procedure. A better        KBF such as linear, rbf, and polynomial. Their parameters
approach is to consider emotion trajectory as a time vary-        as well as the regularization constants λ and β comprise the
ing process and try to track it or use time series modelling      set of hyper-parameters of a KBF system. Unfortunately,
techniques involving state-space models (SSM). A popular          there is no algorithm for learning those hyper-parameters
and simple SSM is the Kalman filter (KF). It is a linear          from data. They have to be set manually and as our experi-
system and is quite fast since it requires just matrix multi-     ments showed are critical for obtaining a good performance.
plications and its complexity is linear in the number of data.
However, the linearity assumption is a big drawback and KF
                                                                  3.    EXPERIMENTS
                                                                    Using the genre information available from the metadata,
Copyright is held by the author/owner(s).
MediaEval 2015 Workshop, September 14-15, 2015, Wurzen, Ger-      we divided all development clips into six clusters roughly
many                                                              corresponding to the following genres:Classical, Electronic,
                                                                  Table 1: Kernel Bayes’ filter results on the develop-
                                                                  ment set.
                                                                          Genre          Arousal          Valence
                                                                                       R      RMSE       R    RMSE
                                                                         Classical   0.282     0.355   0.132   0.390
                                                                        Electronic   0.306     0.347   0.265   0.355
                                                                        Jazz-Blues   0.367     0.357   0.192   0.372
                                                                        Rock-Pop     0.350     0.342   0.167   0.382
                                                                      International  0.219     0.365   0.207   0.371
                                                                    Hip-Hop, SoulRB 0.307      0.342   0.204   0.348
                                                                         Average     0.305     0.351   0.194   0.369


                                                                  Table 3. Due to time limitations, the KBF system uses re-
                                                                  duced (to one forth) training set data which apparently has
                                                                  negative effect on the performance. Since the reference and
                                                                  predicted AV values are scaled to [-1.0, 1.0], direct compar-
Figure 1: Mean A-V values distribution for the de-                ison of the RMSE scores with those from previous tables is
velopment data. Different colors represent differ-                possible when they are divided by 2.
ent genres. Circle sizes are proportional to the A-V
standard deviation.
                                                                  Table 2: Averaged results of all systems on the de-
                                                                  velopment set.
Jazz-Blues, Rock-Pop, International-Folk, HipHop-SoulRB.                  Genre        Arousal         Valence
The number of clusters was chosen such that the data dis-                             R     RMSE      R    RMSE
tribution becomes as uniform as possible.                                             Regression
   In order to visualize the relationship between clustered               Linear    0.269    0.341  0.184   0.357
clips and their emotional content, we calculated arousal and              SVM       0.283    0.340  0.214   0.351
valence statistics per clip and Figure 1 shows the distribution                         Filters
of mean AV vectors in the affect space. Different colors rep-            Kalman     0.113    0.390  0.068   0.393
resent different genres/clusters and the circle size is propor-       Kernel Bayes’ 0.305    0.351  0.194   0.369
tional to the AV standard deviation. As can be seen, there
are no clear grouping by genre, though some genres show
more compact clouds than others. Both filtering systems,
i.e. KF and KBF, were build using this clustering scheme               Table 3: Averaged results on the test set.
where one model was trained for one genre and tested with                Genre     Arousal         Valence
the test data from the same genre only. Linear regression                         R    RMSE      R     RMSE
and SVR based systems were trained with no regard to genre                SVR 0.490    0.446   -0.019   0.542
clusters.                                                                KBR 0.419     0.498   -0.035   0.620
   Since there is no validation data set available, we used 4-
fold cross-validation approach to tune systems’ parameters.
The SVR and KBF models have hyper-parameters such as
kernel function and regularization constants which cannot be      5.   CONCLUSIONS
learned from data. An unconstrained simplex search method            We described several systems developed at the University
was adopted to find optimum parameter setting, however, it        of Aizu for the MediaEval’2015 Emotion in Music evalua-
does not guarantee global maximum and in the case of KBF,         tion campaign. Our focus is on the machine learning part of
it turned out the initial point has a big impact on the final     this very challenging task and, thus, we built and evaluated
result.                                                           few systems based on conventional regression techniques as
                                                                  well as on a new non-parametric non-linear approach us-
                                                                  ing Kernel Bayes’ Filter state-space system. All used the
4.   RESULTS                                                      feature set provided by the challenge organizers. Although
  Before the calculation of the correlation and RMSE per-         the modelling techniques we utilized range from simple lin-
formance measures, predicted arousal and valence values as        ear regression to sophisticated state-space Bayesian filter,
well as the reference values were scaled to fit the range [-      there was a negligible difference in the performance. This
0.5,+0.5]. This similar to the way results were obtained          suggests that the feature set may not have enough discrimi-
during previous evaluations.                                      nating power to enable non-parametric non-linear models to
  Table 1 shows the performance of the KBF for each genre         show their advantages.
as well as the total average. For some genres, the results are
better, which may be due to differences in data distributions,    6.   ACKNOWLEDGEMENT
but also because of a better hyper-parameter settings. Total        Authors would like to thank Dr. Y.Nishiyama from the
averages from all the regression and state-space model based      University of Electro-Communications, Tokyo, for sharing
systems are summarised in Table 2.                                his Matlab Kernel mean toolbox (kmtb).
  The results using the official test data set are shown in
7.   REFERENCES
[1] A. Aljanaki, Y.-H. Yang, and M. Soleymani. Emotion in music
    task at mediaeval 2015. In Working Notes Proceedings of
    the MediaEval 2015 Workshop, September 2015.
[2] K. Fukumizu, L. Song, and A. Gretton. Kernel bayes’ rule:
    Bayesian inference with positive definite kernels. J. Mach.
    Learn. Res., 14(1):3753–3783, Dec. 2013.
[3] K. Markov, T. Matrui, F. Septier, and G. Peters. Dynamic
    speech emotion recognition with state-space models. In Proc.
    EUSIPCO’2015, pages 2122–2126, 2015.
[4] L. Song, K. Fukumizu, and A. Gretton. Kernel embeddings of
    conditional distributions: A unified kernel framework for
    nonparametric inference in graphical models. Signal
    Processing Magazine, IEEE, 30(4):98–111, July 2013.

</pre>