Music emotion recognition using Gaussian Processes

             Konstantin Markov, Motofumi Iwata                                           Tomoko Matsui
                     Human Interface Laboratory                                Department of Statistical Modeling
                       The University of Aizu                                  Institute of Statistical Mathematics
                        Fukushima, Japan                                                   Tokyo, Japan
              {markov,s1180127}@u-aizu.ac.jp                                          tmatsui@ism.ac.jp


ABSTRACT                                                                and the mean m is often set to zero. This assumption allows
This paper describes the music emotion recognition system               to express in closed form the predictive distribution of a test
developed at the University of Aizu for the Emotion in Music            target y∗ only in terms of training data and the input vector
task of the MediaEval’2013 benchmark evaluation campaign.               x∗ : y∗ |x∗ , y, X ∼ N (m∗ , σ∗2 ) where m∗ = kT∗ (K + σn2 I)−1 y
A set of standard feature types provided by the Marsyas                 and σ∗2 = k(x∗ , x∗ ) − kT∗ (K + σn2 I)−1 k∗ .
toolkit was used to parametrize each music clip. Arousal                  Covariance kernel parameters are learned     R   by maximiz-
and valence are modeled separately using Gaussian Process               ing the marginal likelihood p(y|X, θ) = p(y|f )p(f |X, θ)dθ
regression (GPR). We compared performances of the GPR                   w.r.t. θ which is known as maximum likelihood type II ap-
and Support Vector regression (SVR) and found out that                  proximation.
GPR gives better results than SVR for the static per song
emotion estimation task. For the dynamic emotion estima-                3.    SYSTEM DESCRIPTION
tion task GPR had some scalability problems and fair com-                 Dimensional music emotion recognition can be easily de-
parison was not possible.                                               composed into two independent classical regression prob-
                                                                        lems: one for the valence, and another for the arousal. Thus,
1.    INTRODUCTION                                                      our system consists of two regression modules and a common
   Gaussian Processes (GPs) [2] are becoming more and more              feature extraction module.
popular in the Machine Learning community for their ability
to learn highly non-linear mappings between two continuous              3.1     Feature extraction
data spaces, i.e. the feature space and the V/A space. Pre-                Features are extracted only from the audio signal which is
viously, we have successfully applied GPs for music genre               first downsampled to 22050 kHz. We tried various standard
classification task [1] and this encouraged us to use GPs for           features tailored for music processing such as MFCC, Statis-
music emotion estimation. Many previous studies [4] have                tical Spectrum Descriptors (SSD), Chroma, Spectral Crest
focused on Support Vector regression (SVR) since in most                Factor (SCF), and Spectral Flatness Measure (SFM) sepa-
cases it gives superior performance. In this study we com-              rately as well as combinations of several of them. All feature
pare GP regression with SVR and show that in certain cases              vectors were calculated using the Marsyas toolkit with 512
GPR can significantly outperform SVR. In addition, GPR                  samples frames with no overlap. For the dynamic emotion
produces probabilistic predictions, i.e. it outputs a Gaussian          estimation task, first order statistics (mean and std) of the
distribution with mean which corresponds to the most prob-              feature vectors are calculated for a window of about 1 sec.
able target value and variance which shows the certainty of             giving 45 vectors per musical clip. For the static emotion
the prediction. As in the case of SVR, GPR also uses kernels,           estimation, same statistics for these 45 vectors are calcu-
but in contrast, it allows kernels parameters to be learned             lated resulting in a single high dimensional feature vector
from the training data.                                                 per song. After extensive preliminary experimentation we
   Database used in this evaluation is described in detail in           found that the best performing combination of features for
the Emotion in Music task overview paper [3].                           the per song emotion estimation is MFCC, SCF, and SFM.
                                                                        Adding SSD features did not have any noticeable effect, and
                                                                        Chroma features actually hurt the performance. We refer
2.    GAUSSIAN PROCESS REGRESSION                                       to this combination of features as UoA features. We have
   Given input training data vectors X = {xi }, i = i, . . . , n        also experimented with features released by the benchmark
and their corresponding target values y = {yi }, general                organisers which we call MediaEval features.
regression model relates them as: yi = f (xi ) + i where
 ∼ N (0, σn2 ) and f () is an unknown nonlinear function. In           3.2     GPR implementation
GP, it is assumed that this function is normally distributed,             Valence and arousal are modeled by separate GPR. We
i.e. the vector f = [f (x1 ), . . . , f (xn )] has Gaussian distribu-   used standard Gaussian likelihood function which allows ex-
tion f ∼ N (m, K), where K is a kernel covariance matrix                act inference to be performed. The GP mean was set to
                                                                        zero and only the type of covariance kernel was varied. We
                                                                        experimented with the following kernels:
Copyright is held by the author/owner(s).
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain               • Linear (LIN): k(x, x0 ) = (xT x0 + 1)/l2
     • Squared Exponential (SE): k(x, x0 ) = σ 2 exp(−(x −
       x0 )T (x − x0 )/2l2 )                                      Table 2: Kendal τ results of SVR and GPR on de-
                                                                  velopment data using UoA features.
     • Rational Quadratic (RQ): k(x, x0 ) = σ 2 (1+(x−x0 )T (x−         Algorithm   Kernel     Valence Arousal
       x0 )/2αl2 )−α                                                      SVR       Linear      0.288   0.512
                                                                                     RBF        0.346   0.530
       Matérn 3 (MAT3): k(x, x0 ) = σ 2 (1 + r) exp(−r), r =
     • p                                                                  GPR        LIN        0.289   0.508
         3(x − x0 )T (x − x0 )/l2 )                                                   SE        0.327   0.515
                                                                                     RQ         0.339   0.521
where σ and l are parameters learned from training data.                            MAT3        0.333   0.519
Sums or products of several kernels are also valid covariance                     SE+MAT3       0.340   0.523
functions and often give better performance than single ker-
nels.
                                                                  Table 3: Official results on the test data obtained
4.    RESULTS                                                     using GPR.
                                                                      Features (kernel) Measure Valence Arousal
   First, we present our results on the development data ob-
                                                                                     Per song estimation
tained after 7-fold cross validation. In addition to GPR,
                                                                                            RSQ       -0.128 -0.408
results from SVR using the same conditions are given in
                                                                         MediaEval          MSE        0.026  0.043
the following tables. In the SVR case, the parameter C
                                                                         (LIN+RQ)           MAE        0.134  0.172
was manually optimized using a grid search in the range
                                                                                           SE-std      0.031  0.054
[0.01,100], and kernel parameters are set to their default val-
                                                                                           AE-std      0.094  0.116
ues (using LIBSVM package) since they cannot be learned.
   Table 1 shows the result for the static emotion estimation                               RSQ        0.404  0.695
in terms of R2 metrics for both MediaEval and UoA feature                   UoA             MSE        0.014  0.009
sets. The last row of each feature set type shows the best                (SExRQ)           MAE        0.095  0.079
performing combination of GPR covariance kernels.                                          SE-std      0.020  0.013
                                                                                           AE-std      0.070  0.055
                                                                                    Dynamic estimation
Table 1: R2 results of SVR and GPR on development                                         rho-avg      0.025  0.101
data.                                                                       UoA            rho-std     0.020  0.216
       Algorithm    Kernel    Valence Arousal                           (SE+MAT3)         MSE-avg      0.009  0.037
                  MediaEval features                                                      MSE-std      0.010  0.036
         SVR         Linear    0.112   0.300                                             MAE-avg       0.076  0.152
                      RBF      0.017   0.028                                              MAE-std      0.044  0.078
         GPR          LIN      0.132   0.565
                       SE      0.142   0.590
                      RQ       0.150   0.562                      5.   CONCLUSIONS
                     MAT3      0.143   0.590                         We described the UoA emotion recognition system for the
                   LIN+RQ      0.170   0.581                      ”Emotion in Music” task of the MediaEval’2013 benchmark
                     UoA features                                 evaluation which is based on the Gaussian Process regres-
         SVR         Linear    0.314   0.604                      sion algorithm. Compared to the Support Vector regres-
                      RBF      0.367   0.653                      sion, GPR has several advantages, such as truly probabilistic
         GPR          LIN      0.322   0.603                      prediction, and ability to learn hyperparameters from data.
                       SE      0.375   0.656                      Performance wise, the GPR achieved better results for the
                      RQ       0.430   0.662                      static per song emotion estimation, but failed for the dy-
                     MAT3      0.395   0.668                      namic emotion estimation due to some scalability problems.
                    SExRQ      0.437   0.671
                                                                  6.   REFERENCES
   In Table 2, we summarize results of the dynamic emotion        [1] K. Markov and T. Matsui. Music genre classification
estimation task where Kendal τ measure is calculated af-              using gaussian process models. In Proc. IEEE
ter pooling all arousal or valence estimates from all songs           Workshop on Machine Learning for Signal Processing
together. We have to mention that since in this task the              (MLSP), 2013.
amount of data was 40 times bigger, we ran into some scal-        [2] C. Rasmussen and C. Williams. Gaussian Processes for
ability problems with the GPR implementation and had to               Machine Learning. Adaptive Computation and Machine
resort to approximations of the kernel matrix using much              Learning. The MIT Press, 2006.
less data which, of course, decreased the performance no-         [3] M. Soleymani, M. Caro, E. M. Schmidt, C. Sha, and
ticeably.                                                             Y. Yang. 1000 songs for emotional analysis of music. In
   Table 3 presents the results of the UoA submission runs:           Proceedings of the ACM multimedia 2013 workshop on
two for the static and one for the dynamic emotion estima-            Crowdsourcing for Multimedia, CrowdMM. ACM,
tion tasks. They are obtained using GPR with correspond-              ACM, 2013.
ing best performing kernels. Direct comparison with Tables        [4] Y.-H. Yang and H. Chen. Machine recognition of music
1 and 2 is possible only for RSQ lines and it can be seen that        emotion: A review. ACM Transactions on Intelligent
in contrast to MediaEval, UoA features give similar results.          Systems and Technology, 3(3):40:1–40:30, May 2012.