Affective Feature Extraction for Music Emotion Prediction

                      Yang Liu                                Yan Liu                          Zhonglei Gu
             Department of Computer                 Department of Computing               AAOO Tech Limited
                    Science                        The Hong Kong Polytechnic         Shatin, Hong Kong SAR, P. R.
           Hong Kong Baptist University                    University                           China
            Kowloon Tong, Hong Kong                Hung Hom, Hong Kong SAR,                allen.koo@aaoo-
                SAR, P. R. China                          P. R. China                          tech.com
             csygliu@hkbu.edu.hk                   csyliu@comp.polyu.edu.hk

ABSTRACT                                                           Support Vector Regression [9], Boosting [7], Conditional
In this paper, we describe the methods designed for extract-       Random Fields [2], and Gaussian Process [8]. In this task,
ing the affective features from the given music and predicting     we employ the ν-Support Vector Regression (ν-SVR) [10] to
the dynamic emotion ratings along the arousal and valence          predict the arousal and valence labels of the music.
dimensions. The algorithm called Arousal-Valence Similari-           In the remaining part, we first introduce the methods used
ty Preserving Embedding (AV-SPE) is presented to extract           for feature extraction and label prediction in Section 2. In
the intrinsic features embedded in music signal that essen-        Section 3, we report the evaluation results. Finally, we con-
tially evoke human emotions. A standard support vector             clude the paper in Section 4.
regressor is then employed to predict the emotion ratings of
the music along the arousal and valence dimensions. The            2.    METHOD
experimental results demonstrate that the performance of
the proposed method along the arousal dimension is signifi-        2.1    Feature Extraction via Arousal-Valence Sim-
cantly better than the baseline.                                          ilarity Preserving Embedding
                                                                      In order to discover the intrinsic factors in music signals
1.   INTRODUCTION                                                  that convey or evoke emotions along the arousal and valence
                                                                   dimensions, we propose a supervised feature extraction algo-
   The Emotion in Music task in MediaEval 2015 Workshop
                                                                   rithm dubbed AV-SPE to map the original high-dimensional
aims to detect the emotional dynamics of music using its
                                                                   representations into a low-dimensional feature subspace, in
content. Specifically, given a set of songs, participants are
                                                                   which we hope that a clearer linkage between the features
asked to automatically generate continuous emotional rep-
                                                                   and emotions could be discovered.
resentations in arousal and valence. More details of the task
                                                                      Let x ∈ RD be the high-dimensional feature vector of the
as well as the dataset can be found in [1].
                                                                   music at a certain time point (in our specific task, D = 260),
   Feature extraction, which aims to discover the intrinsic
                                                                   and y = [y (1) , y (2) ] be the corresponding emotion label vec-
factors while capturing essentials of original data according
                                                                   tor, where y1 and y2 denote the arousal value and valence
to some criteria, plays an important role in music emotion
                                                                   value, respectively. The idea behind AV-SPE is simple: if
analysis. Some algorithms have been proposed to learn the
                                                                   two pieces of music can convey similar emotions, they should
genuine correlates of music signal evoking emotional respons-
                                                                   possess some hidden features in common. Specifically, giv-
es. You et al. presented a multi-label embedded feature se-
                                                                   en the training set {(x1 , y1 ), ..., (xn , yn )}, AV-SPE aims to
lection (MEFS) method for the task of music emotion clas-
                                                                   learn a transformation matrix U = [u1 , ..., ud ] ∈ RD×d
sification [12]. Liu et al. introduced an algorithm called
                                                                   which is able to project the original D-dimensional data to
multi-emotion similarity preserving embedding (ME-SPE)
                                                                   an intrinsically low-dimensional subspace Z = Rd , where the
by considering the correlation between different music emo-
                                                                   data with similar emotion labels are close to each other. The
tions, and then analyzed the relationship between the low-
                                                                   objective function of AV-SPE is formulated as follows:
dimensional features and the music emotions [6]. In this
paper, we propose a feature extraction algorithm, arousal-                  U = arg min J(U)
valence similarity preserving embedding (AV-SPE), which                             U
                                                                                          n X
                                                                                            n                                   (1)
inherits the basic idea from ME-SPE. The difference is that                               X
the emotion labels in ME-SPE are in the binary form, i.e.,                    = arg min             kUT xi − UT xj k2 · Sij ,
                                                                                    U     i=1 j=1
0 or 1, while those in AV-SPE could be any real number
between [−1, 1].                                                   where Sij = hŷi , ŷj i = hyi /||yi ||, yj /||yj ||i denotes emo-
   In order to learn the relationship between the feature          tional similarity between xi and xj (i, j = 1, ..., n).
space and the dimensional emotion space, which is composed            Following some standard operations in linear algebra, above
of the arousal dimension and the valence dimension, many           optimization problem could be reduced to a trace minimiza-
popular machine learning approaches have already been em-          tion problem:
ployed to train the model, such as k-Nearest Neighbor [11],
                                                                                   U = arg min tr(UT XLXT U),                   (2)
                                                                                           U

                                                                   where X = [x1 , x2 , ..., xn ] ∈ RD×n is the data matrix, L =
Copyright is held by the author/owner(s).
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany        D − Ŝ is the n × n Laplacian matrix [3], D is a diagonal
                            Table 1: Averaged RMSE of The Baseline and Proposed Methods
                                                                   Arousal          Valence
                                Random Baseline (D = 260)        0.28 ± 0.13      0.29 ± 0.14
                              Multilinear Regression (D = 260)   0.27 ± 0.11     0.366 ± 0.18
                                      ν-SVR (D = 260)          0.2377 ± 0.1089 0.3834 ± 0.1943
                                 AV-SPE + ν-SVR (d = 10)       0.2414 ± 0.1081 0.3689 ± 0.1863


                           Table 2: Averaged Correlation of the Prediction and Ground Truth
                                                                  Arousal          Valence
                             Multilinear Regression (D = 260)   0.36 ± 0.26      0.01 ± 0.38
                                     ν-SVR (D = 260)          0.5610 ± 0.2705 −0.0217 ± 0.4494
                                AV-SPE + ν-SVR (d = 10)       0.5806 ± 0.2290  0.0133 ± 0.4811

                            Pn
matrix defined as Dii =        j=1 Ŝij (i = 1, ..., n), and tr(·)      is the n-dimensional vector of all ones, and C > 0 is the reg-
denotes the matrix trace operator. Obviously, L is positive             ularization parameter. The prediction label of a new coming
semi-definite and D is positive definite. By transforming               vector z is:
(1) to (2), the optimal solution can be easily obtained by                                  n
                                                                                            X
employing standard eigendecomposition. Additionally, we                                y=     (αi∗ − αi )K(zi , z) + b.           (7)
introduce the constraint UT XDXT U = Id to remove the                                       i=1
scaling factor in the learning process, where Id denotes the d-
dimensional identity matrix. So for the first transformation            3.   EVALUATION RESULTS
vector u1 , the problem becomes
                                                                          In this section, we report the experimental settings and
                u1 =     arg min      uT1 XLXT u1 .              (3)    the evaluation results. The features used in our experiments
                            u1                                          are extracted via the openSMILE toolbox [5]. The original
                       uT     T
                        1 XDX u1 =1                                     dimension of the feature space, i.e., D, is 260. We set the
Then we obtain the Lagrangian equation of (3)                           reduced dimension d = 10 for AV-SPE. The SVR toolbox
                                                                        we have used is the LIBSVM [4]. In the training process, we
       L(u1 , λ) = uT1 XLXT u1 − λ(uT1 XDXT u1 − 1).             (4)    use the radial basis function (RBF) as the kernel function.
                                                                        The ten-fold cross-validation is employed to select the best
Letting ∂L(u1 , λ)/∂u1 = 0, the optimal u1 is the eigenvec-             parameters γ and C. We finally select γ = 0.125, C = 2,
tor corresponding to the smallest non-zero eigenvalue of the            and ν = 0.5 for our model.
generalized eigendecomposition problem                                    Table 1 and Table 2 list the averaged Root-Mean-Square
                                                                        Error (RMSE) and the averaged correlation, respectively.
                     XLXT u = λXDXT u.                           (5)    From the tables, we can observe that the arousal result of ν-
Similarly, u2 , ..., ud are the eigenvectors corresponding to           SVR and that of AV-SPE + ν-SVR are significantly better
the 2-nd, ..., d-th smallest non-zero eigenvalues of (5), re-           than the baseline. Moreover, the results on the reduced
spectively.                                                             feature space (d = 10), i.e., the results of AV-SPE + ν-
                                                                        SVR, are comparable to the results on the original feature
2.2     Music Emotion Prediction via Support Vec-                       space (D = 260), which indicates that the extracted features
                                                                        play an important role in representing the music emotions.
        tor Regression
   After feature extraction, we can obtain the reduced fea-
tures by zi = UT xi . We then use the reduced features                  4.   CONCLUSIONS
as the input to predict the emotion labels of the music via                In this working notes paper, we have introduced our sys-
the ν-Support Vector Regression (ν-SVR) [10]. Given the                 tem for music emotional dynamics prediction. The system
training set {(z1 , y1 ), ..., (zn , yn )}, where zi is the extracted   is composed of a feature extraction algorithm and a support
                                (1)   (2)                               vector regressor. The evaluation results shown that the fea-
feature vector and yi = [yi , yi ] is the corresponding label
vector including arousal and valence values. For predicting             tures extracted by the proposed AV-SPE are informative,
                                              (1)        (2)
the arousal and valence values, i.e., yi and yi , we train              and the system worked well in predicting the arousal val-
two regressor separately. The final optimization problem,               ues. Our future work will focus on extending the proposed
i.e., the dual problem that ν-SVR aims to solve is:                     algorithm by considering the dynamic nature of the music
                                                                        data.
            1
      min     (α − α∗ )T K(α − α∗ ) + (y(m) )T (α − α∗ )
    α,α∗    2                                                           5.   ACKNOWLEDGMENTS
                                                                 (6)
      s.t. eT (α − α∗ ) = 0, eT (α + α∗ ) ≤ Cν                             The authors would like to thank the reviewer for the help-
           0 ≤ αi , αi∗ ≤ C/n, i = 1, ..., n,                           ful comments. This work was supported in part by the Na-
                                                                        tional Natural Science Foundation of China under Grants
where αi , αi∗ are the Lagrange multipliers, K is an n × n              61373122.
positive semidefinite matrix, in which Kij = K(zi , zj ) =
φ(zi )T φ(zj ) is the kernel function, m = 1 or 2, e = [1, ..., 1]T
                                                                        6.   REFERENCES
 [1] A. Aljanaki, Y.-H. Yang, and M. Soleymani. Emotion
     in music task at mediaeval 2015. In Working Notes
     Proceedings of the MediaEval 2015 Workshop,
     September 2015.
 [2] T. Baltrušaitis, N. Banda, and P. Robinson.
     Dimensional affect recognition using continuous
     conditional random fields. In IEEE Conference on
     Automatic Face and Gesture Recognition, 2013.
 [3] M. Belkin and P. Niyogi. Laplacian eigenmaps for
     dimensionality reduction and data representation.
     Neural Comput., 15(6):1373–1396, 2003.
 [4] C.-C. Chang and C.-J. Lin. LIBSVM: A library for
     support vector machines. ACM Transactions on
     Intelligent Systems and Technology, 2:27:1–27:27,
     2011.
 [5] F. Eyben, F. Weninger, F. Gross, and B. Schuller.
     Recent developments in opensmile, the munich
     open-source multimedia feature extractor. In Proc.
     21st ACM International Conference on Multimedia,
     pages 835–838, 2013.
 [6] Y. Liu, Y. Liu, Y. Zhao, and K. Hua. What strikes the
     strings of your heart? – feature mining for music
     emotion analysis. IEEE Transactions on Affective
     Computing, PP(99):1–1, 2015.
 [7] Q. Lu, X. Chen, D. Yang, and J. Wang. Boosting for
     multi-modal music emotion classification. In Proc.
     11th ISMIR, pages 105–110, 2010.
 [8] K. Markov and T. Matsui. Music genre and emotion
     recognition using gaussian processes. IEEE Access,
     2:688–697, 2014.
 [9] S. Rho, B.-j. Han, and E. Hwang. Svr-based music
     mood classification and context-based music
     recommendation. In Proc. 17th ACM Multimedia,
     pages 713–716, 2009.
[10] B. Schölkopf, A. J. Smola, R. C. Williamson, and
     P. L. Bartlett. New support vector algorithms. Neural
     Comput., 12(5):1207–1245, 2000.
[11] Y.-H. Yang, C.-C. Liu, and H. H. Chen. Music
     emotion classification: A fuzzy approach. In Proc.
     14th ACM Multimedia, pages 81–84, 2006.
[12] M. You, J. Liu, G.-Z. Li, and Y. Chen. Embedded
     feature selection for multi-label classification of music
     emotions. International Journal of Computational
     Intelligence Systems, 5(4):668–678, 2012.