Affective Feature Extraction for Music Emotion Prediction Yang Liu Yan Liu Zhonglei Gu Department of Computer Department of Computing AAOO Tech Limited Science The Hong Kong Polytechnic Shatin, Hong Kong SAR, P. R. Hong Kong Baptist University University China Kowloon Tong, Hong Kong Hung Hom, Hong Kong SAR, allen.koo@aaoo- SAR, P. R. China P. R. China tech.com csygliu@hkbu.edu.hk csyliu@comp.polyu.edu.hk ABSTRACT Support Vector Regression [9], Boosting [7], Conditional In this paper, we describe the methods designed for extract- Random Fields [2], and Gaussian Process [8]. In this task, ing the affective features from the given music and predicting we employ the ν-Support Vector Regression (ν-SVR) [10] to the dynamic emotion ratings along the arousal and valence predict the arousal and valence labels of the music. dimensions. The algorithm called Arousal-Valence Similari- In the remaining part, we first introduce the methods used ty Preserving Embedding (AV-SPE) is presented to extract for feature extraction and label prediction in Section 2. In the intrinsic features embedded in music signal that essen- Section 3, we report the evaluation results. Finally, we con- tially evoke human emotions. A standard support vector clude the paper in Section 4. regressor is then employed to predict the emotion ratings of the music along the arousal and valence dimensions. The 2. METHOD experimental results demonstrate that the performance of the proposed method along the arousal dimension is signifi- 2.1 Feature Extraction via Arousal-Valence Sim- cantly better than the baseline. ilarity Preserving Embedding In order to discover the intrinsic factors in music signals 1. INTRODUCTION that convey or evoke emotions along the arousal and valence dimensions, we propose a supervised feature extraction algo- The Emotion in Music task in MediaEval 2015 Workshop rithm dubbed AV-SPE to map the original high-dimensional aims to detect the emotional dynamics of music using its representations into a low-dimensional feature subspace, in content. Specifically, given a set of songs, participants are which we hope that a clearer linkage between the features asked to automatically generate continuous emotional rep- and emotions could be discovered. resentations in arousal and valence. More details of the task Let x ∈ RD be the high-dimensional feature vector of the as well as the dataset can be found in [1]. music at a certain time point (in our specific task, D = 260), Feature extraction, which aims to discover the intrinsic and y = [y (1) , y (2) ] be the corresponding emotion label vec- factors while capturing essentials of original data according tor, where y1 and y2 denote the arousal value and valence to some criteria, plays an important role in music emotion value, respectively. The idea behind AV-SPE is simple: if analysis. Some algorithms have been proposed to learn the two pieces of music can convey similar emotions, they should genuine correlates of music signal evoking emotional respons- possess some hidden features in common. Specifically, giv- es. You et al. presented a multi-label embedded feature se- en the training set {(x1 , y1 ), ..., (xn , yn )}, AV-SPE aims to lection (MEFS) method for the task of music emotion clas- learn a transformation matrix U = [u1 , ..., ud ] ∈ RD×d sification [12]. Liu et al. introduced an algorithm called which is able to project the original D-dimensional data to multi-emotion similarity preserving embedding (ME-SPE) an intrinsically low-dimensional subspace Z = Rd , where the by considering the correlation between different music emo- data with similar emotion labels are close to each other. The tions, and then analyzed the relationship between the low- objective function of AV-SPE is formulated as follows: dimensional features and the music emotions [6]. In this paper, we propose a feature extraction algorithm, arousal- U = arg min J(U) valence similarity preserving embedding (AV-SPE), which U n X n (1) inherits the basic idea from ME-SPE. The difference is that X the emotion labels in ME-SPE are in the binary form, i.e., = arg min kUT xi − UT xj k2 · Sij , U i=1 j=1 0 or 1, while those in AV-SPE could be any real number between [−1, 1]. where Sij = hŷi , ŷj i = hyi /||yi ||, yj /||yj ||i denotes emo- In order to learn the relationship between the feature tional similarity between xi and xj (i, j = 1, ..., n). space and the dimensional emotion space, which is composed Following some standard operations in linear algebra, above of the arousal dimension and the valence dimension, many optimization problem could be reduced to a trace minimiza- popular machine learning approaches have already been em- tion problem: ployed to train the model, such as k-Nearest Neighbor [11], U = arg min tr(UT XLXT U), (2) U where X = [x1 , x2 , ..., xn ] ∈ RD×n is the data matrix, L = Copyright is held by the author/owner(s). MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany D − Ŝ is the n × n Laplacian matrix [3], D is a diagonal Table 1: Averaged RMSE of The Baseline and Proposed Methods Arousal Valence Random Baseline (D = 260) 0.28 ± 0.13 0.29 ± 0.14 Multilinear Regression (D = 260) 0.27 ± 0.11 0.366 ± 0.18 ν-SVR (D = 260) 0.2377 ± 0.1089 0.3834 ± 0.1943 AV-SPE + ν-SVR (d = 10) 0.2414 ± 0.1081 0.3689 ± 0.1863 Table 2: Averaged Correlation of the Prediction and Ground Truth Arousal Valence Multilinear Regression (D = 260) 0.36 ± 0.26 0.01 ± 0.38 ν-SVR (D = 260) 0.5610 ± 0.2705 −0.0217 ± 0.4494 AV-SPE + ν-SVR (d = 10) 0.5806 ± 0.2290 0.0133 ± 0.4811 Pn matrix defined as Dii = j=1 Ŝij (i = 1, ..., n), and tr(·) is the n-dimensional vector of all ones, and C > 0 is the reg- denotes the matrix trace operator. Obviously, L is positive ularization parameter. The prediction label of a new coming semi-definite and D is positive definite. By transforming vector z is: (1) to (2), the optimal solution can be easily obtained by n X employing standard eigendecomposition. Additionally, we y= (αi∗ − αi )K(zi , z) + b. (7) introduce the constraint UT XDXT U = Id to remove the i=1 scaling factor in the learning process, where Id denotes the d- dimensional identity matrix. So for the first transformation 3. EVALUATION RESULTS vector u1 , the problem becomes In this section, we report the experimental settings and u1 = arg min uT1 XLXT u1 . (3) the evaluation results. The features used in our experiments u1 are extracted via the openSMILE toolbox [5]. The original uT T 1 XDX u1 =1 dimension of the feature space, i.e., D, is 260. We set the Then we obtain the Lagrangian equation of (3) reduced dimension d = 10 for AV-SPE. The SVR toolbox we have used is the LIBSVM [4]. In the training process, we L(u1 , λ) = uT1 XLXT u1 − λ(uT1 XDXT u1 − 1). (4) use the radial basis function (RBF) as the kernel function. The ten-fold cross-validation is employed to select the best Letting ∂L(u1 , λ)/∂u1 = 0, the optimal u1 is the eigenvec- parameters γ and C. We finally select γ = 0.125, C = 2, tor corresponding to the smallest non-zero eigenvalue of the and ν = 0.5 for our model. generalized eigendecomposition problem Table 1 and Table 2 list the averaged Root-Mean-Square Error (RMSE) and the averaged correlation, respectively. XLXT u = λXDXT u. (5) From the tables, we can observe that the arousal result of ν- Similarly, u2 , ..., ud are the eigenvectors corresponding to SVR and that of AV-SPE + ν-SVR are significantly better the 2-nd, ..., d-th smallest non-zero eigenvalues of (5), re- than the baseline. Moreover, the results on the reduced spectively. feature space (d = 10), i.e., the results of AV-SPE + ν- SVR, are comparable to the results on the original feature 2.2 Music Emotion Prediction via Support Vec- space (D = 260), which indicates that the extracted features play an important role in representing the music emotions. tor Regression After feature extraction, we can obtain the reduced fea- tures by zi = UT xi . We then use the reduced features 4. CONCLUSIONS as the input to predict the emotion labels of the music via In this working notes paper, we have introduced our sys- the ν-Support Vector Regression (ν-SVR) [10]. Given the tem for music emotional dynamics prediction. The system training set {(z1 , y1 ), ..., (zn , yn )}, where zi is the extracted is composed of a feature extraction algorithm and a support (1) (2) vector regressor. The evaluation results shown that the fea- feature vector and yi = [yi , yi ] is the corresponding label vector including arousal and valence values. For predicting tures extracted by the proposed AV-SPE are informative, (1) (2) the arousal and valence values, i.e., yi and yi , we train and the system worked well in predicting the arousal val- two regressor separately. The final optimization problem, ues. Our future work will focus on extending the proposed i.e., the dual problem that ν-SVR aims to solve is: algorithm by considering the dynamic nature of the music data. 1 min (α − α∗ )T K(α − α∗ ) + (y(m) )T (α − α∗ ) α,α∗ 2 5. ACKNOWLEDGMENTS (6) s.t. eT (α − α∗ ) = 0, eT (α + α∗ ) ≤ Cν The authors would like to thank the reviewer for the help- 0 ≤ αi , αi∗ ≤ C/n, i = 1, ..., n, ful comments. This work was supported in part by the Na- tional Natural Science Foundation of China under Grants where αi , αi∗ are the Lagrange multipliers, K is an n × n 61373122. positive semidefinite matrix, in which Kij = K(zi , zj ) = φ(zi )T φ(zj ) is the kernel function, m = 1 or 2, e = [1, ..., 1]T 6. REFERENCES [1] A. Aljanaki, Y.-H. Yang, and M. Soleymani. Emotion in music task at mediaeval 2015. In Working Notes Proceedings of the MediaEval 2015 Workshop, September 2015. [2] T. Baltrušaitis, N. Banda, and P. Robinson. Dimensional affect recognition using continuous conditional random fields. In IEEE Conference on Automatic Face and Gesture Recognition, 2013. [3] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput., 15(6):1373–1396, 2003. [4] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. [5] F. Eyben, F. Weninger, F. Gross, and B. Schuller. Recent developments in opensmile, the munich open-source multimedia feature extractor. In Proc. 21st ACM International Conference on Multimedia, pages 835–838, 2013. [6] Y. Liu, Y. Liu, Y. Zhao, and K. Hua. What strikes the strings of your heart? – feature mining for music emotion analysis. IEEE Transactions on Affective Computing, PP(99):1–1, 2015. [7] Q. Lu, X. Chen, D. Yang, and J. Wang. Boosting for multi-modal music emotion classification. In Proc. 11th ISMIR, pages 105–110, 2010. [8] K. Markov and T. Matsui. Music genre and emotion recognition using gaussian processes. IEEE Access, 2:688–697, 2014. [9] S. Rho, B.-j. Han, and E. Hwang. Svr-based music mood classification and context-based music recommendation. In Proc. 17th ACM Multimedia, pages 713–716, 2009. [10] B. Schölkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. Neural Comput., 12(5):1207–1245, 2000. [11] Y.-H. Yang, C.-C. Liu, and H. H. Chen. Music emotion classification: A fuzzy approach. In Proc. 14th ACM Multimedia, pages 81–84, 2006. [12] M. You, J. Liu, G.-Z. Li, and Y. Chen. Embedded feature selection for multi-label classification of music emotions. International Journal of Computational Intelligence Systems, 5(4):668–678, 2012.