Music emotion recognition using Gaussian Processes Konstantin Markov, Motofumi Iwata Tomoko Matsui Human Interface Laboratory Department of Statistical Modeling The University of Aizu Institute of Statistical Mathematics Fukushima, Japan Tokyo, Japan {markov,s1180127}@u-aizu.ac.jp tmatsui@ism.ac.jp ABSTRACT and the mean m is often set to zero. This assumption allows This paper describes the music emotion recognition system to express in closed form the predictive distribution of a test developed at the University of Aizu for the Emotion in Music target y∗ only in terms of training data and the input vector task of the MediaEval’2013 benchmark evaluation campaign. x∗ : y∗ |x∗ , y, X ∼ N (m∗ , σ∗2 ) where m∗ = kT∗ (K + σn2 I)−1 y A set of standard feature types provided by the Marsyas and σ∗2 = k(x∗ , x∗ ) − kT∗ (K + σn2 I)−1 k∗ . toolkit was used to parametrize each music clip. Arousal Covariance kernel parameters are learned R by maximiz- and valence are modeled separately using Gaussian Process ing the marginal likelihood p(y|X, θ) = p(y|f )p(f |X, θ)dθ regression (GPR). We compared performances of the GPR w.r.t. θ which is known as maximum likelihood type II ap- and Support Vector regression (SVR) and found out that proximation. GPR gives better results than SVR for the static per song emotion estimation task. For the dynamic emotion estima- 3. SYSTEM DESCRIPTION tion task GPR had some scalability problems and fair com- Dimensional music emotion recognition can be easily de- parison was not possible. composed into two independent classical regression prob- lems: one for the valence, and another for the arousal. Thus, 1. INTRODUCTION our system consists of two regression modules and a common Gaussian Processes (GPs) [2] are becoming more and more feature extraction module. popular in the Machine Learning community for their ability to learn highly non-linear mappings between two continuous 3.1 Feature extraction data spaces, i.e. the feature space and the V/A space. Pre- Features are extracted only from the audio signal which is viously, we have successfully applied GPs for music genre first downsampled to 22050 kHz. We tried various standard classification task [1] and this encouraged us to use GPs for features tailored for music processing such as MFCC, Statis- music emotion estimation. Many previous studies [4] have tical Spectrum Descriptors (SSD), Chroma, Spectral Crest focused on Support Vector regression (SVR) since in most Factor (SCF), and Spectral Flatness Measure (SFM) sepa- cases it gives superior performance. In this study we com- rately as well as combinations of several of them. All feature pare GP regression with SVR and show that in certain cases vectors were calculated using the Marsyas toolkit with 512 GPR can significantly outperform SVR. In addition, GPR samples frames with no overlap. For the dynamic emotion produces probabilistic predictions, i.e. it outputs a Gaussian estimation task, first order statistics (mean and std) of the distribution with mean which corresponds to the most prob- feature vectors are calculated for a window of about 1 sec. able target value and variance which shows the certainty of giving 45 vectors per musical clip. For the static emotion the prediction. As in the case of SVR, GPR also uses kernels, estimation, same statistics for these 45 vectors are calcu- but in contrast, it allows kernels parameters to be learned lated resulting in a single high dimensional feature vector from the training data. per song. After extensive preliminary experimentation we Database used in this evaluation is described in detail in found that the best performing combination of features for the Emotion in Music task overview paper [3]. the per song emotion estimation is MFCC, SCF, and SFM. Adding SSD features did not have any noticeable effect, and Chroma features actually hurt the performance. We refer 2. GAUSSIAN PROCESS REGRESSION to this combination of features as UoA features. We have Given input training data vectors X = {xi }, i = i, . . . , n also experimented with features released by the benchmark and their corresponding target values y = {yi }, general organisers which we call MediaEval features. regression model relates them as: yi = f (xi ) + i where  ∼ N (0, σn2 ) and f () is an unknown nonlinear function. In 3.2 GPR implementation GP, it is assumed that this function is normally distributed, Valence and arousal are modeled by separate GPR. We i.e. the vector f = [f (x1 ), . . . , f (xn )] has Gaussian distribu- used standard Gaussian likelihood function which allows ex- tion f ∼ N (m, K), where K is a kernel covariance matrix act inference to be performed. The GP mean was set to zero and only the type of covariance kernel was varied. We experimented with the following kernels: Copyright is held by the author/owner(s). MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain • Linear (LIN): k(x, x0 ) = (xT x0 + 1)/l2 • Squared Exponential (SE): k(x, x0 ) = σ 2 exp(−(x − x0 )T (x − x0 )/2l2 ) Table 2: Kendal τ results of SVR and GPR on de- velopment data using UoA features. • Rational Quadratic (RQ): k(x, x0 ) = σ 2 (1+(x−x0 )T (x− Algorithm Kernel Valence Arousal x0 )/2αl2 )−α SVR Linear 0.288 0.512 RBF 0.346 0.530 Matérn 3 (MAT3): k(x, x0 ) = σ 2 (1 + r) exp(−r), r = • p GPR LIN 0.289 0.508 3(x − x0 )T (x − x0 )/l2 ) SE 0.327 0.515 RQ 0.339 0.521 where σ and l are parameters learned from training data. MAT3 0.333 0.519 Sums or products of several kernels are also valid covariance SE+MAT3 0.340 0.523 functions and often give better performance than single ker- nels. Table 3: Official results on the test data obtained 4. RESULTS using GPR. Features (kernel) Measure Valence Arousal First, we present our results on the development data ob- Per song estimation tained after 7-fold cross validation. In addition to GPR, RSQ -0.128 -0.408 results from SVR using the same conditions are given in MediaEval MSE 0.026 0.043 the following tables. In the SVR case, the parameter C (LIN+RQ) MAE 0.134 0.172 was manually optimized using a grid search in the range SE-std 0.031 0.054 [0.01,100], and kernel parameters are set to their default val- AE-std 0.094 0.116 ues (using LIBSVM package) since they cannot be learned. Table 1 shows the result for the static emotion estimation RSQ 0.404 0.695 in terms of R2 metrics for both MediaEval and UoA feature UoA MSE 0.014 0.009 sets. The last row of each feature set type shows the best (SExRQ) MAE 0.095 0.079 performing combination of GPR covariance kernels. SE-std 0.020 0.013 AE-std 0.070 0.055 Dynamic estimation Table 1: R2 results of SVR and GPR on development rho-avg 0.025 0.101 data. UoA rho-std 0.020 0.216 Algorithm Kernel Valence Arousal (SE+MAT3) MSE-avg 0.009 0.037 MediaEval features MSE-std 0.010 0.036 SVR Linear 0.112 0.300 MAE-avg 0.076 0.152 RBF 0.017 0.028 MAE-std 0.044 0.078 GPR LIN 0.132 0.565 SE 0.142 0.590 RQ 0.150 0.562 5. CONCLUSIONS MAT3 0.143 0.590 We described the UoA emotion recognition system for the LIN+RQ 0.170 0.581 ”Emotion in Music” task of the MediaEval’2013 benchmark UoA features evaluation which is based on the Gaussian Process regres- SVR Linear 0.314 0.604 sion algorithm. Compared to the Support Vector regres- RBF 0.367 0.653 sion, GPR has several advantages, such as truly probabilistic GPR LIN 0.322 0.603 prediction, and ability to learn hyperparameters from data. SE 0.375 0.656 Performance wise, the GPR achieved better results for the RQ 0.430 0.662 static per song emotion estimation, but failed for the dy- MAT3 0.395 0.668 namic emotion estimation due to some scalability problems. SExRQ 0.437 0.671 6. REFERENCES In Table 2, we summarize results of the dynamic emotion [1] K. Markov and T. Matsui. Music genre classification estimation task where Kendal τ measure is calculated af- using gaussian process models. In Proc. IEEE ter pooling all arousal or valence estimates from all songs Workshop on Machine Learning for Signal Processing together. We have to mention that since in this task the (MLSP), 2013. amount of data was 40 times bigger, we ran into some scal- [2] C. Rasmussen and C. Williams. Gaussian Processes for ability problems with the GPR implementation and had to Machine Learning. Adaptive Computation and Machine resort to approximations of the kernel matrix using much Learning. The MIT Press, 2006. less data which, of course, decreased the performance no- [3] M. Soleymani, M. Caro, E. M. Schmidt, C. Sha, and ticeably. Y. Yang. 1000 songs for emotional analysis of music. In Table 3 presents the results of the UoA submission runs: Proceedings of the ACM multimedia 2013 workshop on two for the static and one for the dynamic emotion estima- Crowdsourcing for Multimedia, CrowdMM. ACM, tion tasks. They are obtained using GPR with correspond- ACM, 2013. ing best performing kernels. Direct comparison with Tables [4] Y.-H. Yang and H. Chen. Machine recognition of music 1 and 2 is possible only for RSQ lines and it can be seen that emotion: A review. ACM Transactions on Intelligent in contrast to MediaEval, UoA features give similar results. Systems and Technology, 3(3):40:1–40:30, May 2012.