Mining Emotional Features of Movies Yang Liu1,2 , Zhonglei Gu3 , Yu Zhang4 , Yan Liu5 1 Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR, China 2 Institute of Research and Continuing Education, Hong Kong Baptist University, Shenzhen, China 3 AAOO Tech Limited, Hong Kong SAR, China 4 Department of CSE, Hong Kong University of Science and Technology, Hong Kong SAR, China 5 Department of Computing, The Hong Kong Polytechnic University, Hong Kong SAR, China csygliu@comp.hkbu.edu.hk, allen.koo@aaoo-tech.com, zhangyu@cse.ust.hk csyliu@comp.polyu.edu.hk ABSTRACT which the data with similar A-V values are close to each In this paper, we present the algorithm designed for mining other, while the data with different A-V values are faraway emotional features of movies. The algorithm dubbed from each other. Arousal-Valence Discriminant Preserving Embedding (AV- Let x ∈ RD be the high-dimensional feature vector DPE) is proposed to extract the intrinsic features embedded of the movie, and y = [y (1) , y (2) ] be the corresponding in movies that are essentially differentiating in both arousal emotion label vector, where y (1) and y (2) denote the arousal and valence directions. After dimensionality reduction, we value and valence value, respectively. Given the training use the neural network and support vector regressor to make set {(x1 , y1 ), ..., (xn , yn )}, AV-DPE aims at learning a the final prediction. Experimental results show that the transformation matrix U = [u1 , ..., ud ] ∈ RD×d which is able extracted features can capture most of the discriminant to project the original D-dimensional data to an intrinsically information in movie emotions. low-dimensional subspace Z = Rd . In order to describe the similarity between data samples, we define the following adjacency scatter matrix: 1. INTRODUCTION n X n Affective multimedia content analysis aims to auto- X Sa = Aij (xi − xj )(xi − xj )T , (1) matically recognize and analyze the emotions evoked by i=1 j=1 multimedia data such as images, music, and videos. It has a lot of real-world applications such as image search, movie where Aij denotes the similarity between the i-th and j-th recommendation, and music classification [3, 7–9, 11–14]. data points. In our formulation, we use the form of inner In this 2016 Emotional Impact of Movies Task, the product between the corresponding label vectors associated participants are required to design algorithms to predict the with xi and xj . To further normalize the similarity values arousal and valence values of the given movies automatically. into interval [0, 1], we define the normalized adjacency The dataset used in this task is the LIRIS-ACCEDE dataset matrix  where (liris-accede.ec-lyon.fr). It contains videos from a set of Âij = hŷi , ŷj i = hyi /||yi ||, yj /||yj ||i. (2) 160 professionally made and amateur movies, shared under the Creative Commons licenses that allow redistribution [2]. The normalized adjacency scatter matrix is then defined as: More details of the task requirements as well as the dataset n X X n description can be found in [5, 10]. Ŝa = Âij (xi − xj )(xi − xj )T . (3) In this paper, we perform both global and continuous i=1 j=1 emotion predictions via a proposed supervised dimensionali- ty reduction algorithm called Arousal-Valence Discriminant Similarly, we define the normalized discriminant scatter Preserving Embedding (AV-DPE), which learns the compact matrix to characterize the dissimilarity between data points: representations of the original data. After obtaining the low- n X X n dimensional features, we use the neural network and support Ŝd = D̂ij (xi − xj )(xi − xj )T , (4) vector regressor to predict the emotion values. i=1 j=1 where we simply define D̂ij = 1 − Âij . 2. PROPOSED METHOD In order to maximize the distance between data points In order to derive the intrinsic factors in movies that with different labels while minimizing the distance between convey or evoke emotions along the arousal and valence data points with similar labels, the objective function of AV- dimensions, we propose a supervised feature extraction DPE is formulated as follows: algorithm dubbed Arousal-Valence Discriminant Preserving U = arg max{tr((UT Ŝa U)† UT Ŝd U)}, (5) Embedding (AV-DPE) to map the original high-dimensional U representations into a low-dimensional feature subspace, in where tr(·) denotes the matrix trace operation and (Ŝa )† denotes the Moore-Penrose pseudoinverse of Ŝa [6]. The Copyright is held by the author/owner(s). optimization problem in Eq. (5) can be solved by some MediaEval 2016 Workshop, October 20-21, 2016, Hilversum, Netherlands. standard matrix decomposition techniques [6]. Table 1: Results on global emotion prediction XXCriteria Arousal Valence XX Runs XXX X MSE Pearson’s CC MSE Pearson’s CC #1 1.18511707891 0.158772315634 0.235909661034 0.102487446458 #2 1.18260763366 0.174547894742 0.378511708782 0.378511708782 #3 1.46475414861 0.212414301359 0.267627565271 0.089311269390 #4 1.61515123698 0.201427253365 0.239352667040 0.133965496755 Table 2: Results on continuous emotion prediction XXCriteria Arousal Valence XX Runs XXX X MSE Pearson’s CC MSE Pearson’s CC #1 0.152869437388 0.0500544335696 0.125062204735 0.00901181966468 #2 0.128197164652 0.0557718765692 0.105905051008 0.0117374077757 #3 0.125552338276 0.0266523947466 0.139507683129 0.00139093558922 #4 0.293856466692 0.0266523946850 0.124565684871 0.0192993915142 3. EXPERIMENTS • Run #3: We first use the proposed AV-DPE to reduce In this section, we report the experimental settings and the original high-dimensional feature space to the 100- the evaluation results. D subspace. Then we use the neural network for Global emotion prediction: we construct a 34-D prediction. The setting of neural network is the same feature set, including alpha, asymmetry env, colorfulness, as that in Run #1 of global emotion prediction. colorRawEnergy, colorStrength, compositionalBalance, cut- Length, depthOfField, entropyComplexity, flatness, glob- • Run #4: We first use the proposed AV-DPE to reduce alActivity, hueCount, lightning, maxSaliencyCount, medi- the original high-dimensional feature space to the 100- anLightness, minEnergy, nbFades, nbSceneCuts, nbWhite- D subspace. Then we use the ν-SVR for prediction. Frames, saliencyDisparity, spatialEdgeDistributionArea, The setting of ν-SVR is the same as that in Run #2 wtf max2stdratio {1-12} and zcr. Note that all above of global emotion prediction. features are provided by the task organizers. Table 1 and Table 2 report the results of our system. From • Run #1: We use the original 34-D features as the the tables we can see that after dimensionality reduction, input, and then use a function fitting neural network the performance of the reduced features (Run #3 and Run [1] with 100 nodes in the hidden layer for prediction. #4) is generally worse than that of the original features The Levenberg-Marquardt backpropagation function (Run #1 and Run #2), which indicates that the emotion is used in training. information in movies is relatively complex, and thus we may not be able to fully describe it using just a few dimensions. • Run #2: We use the original 34-D features as input, However, considering that the dimension of the reduced and then use the ν-support vector regression (ν-SVR) features is much less than that of the original features, we for prediction. In ν-SVR, the RBF kernel is utilized still can conclude that the learned subspace preserves rich with the default setting from LIBSVM [4], i.e., cost = discriminant information of the original feature space. 1, ν = 0.5, and γ is then set to be the reciprocal of the Moreover, from both tables we can observe that the number of feature dimension. neural network performs more robust than SVR after • Run #3: We first use the proposed AV-DPE to reduce dimensionality reduction. The possible reason is that the original feature space to the 10-D subspace. Then besides the discriminant ability, the neural network with utilize the neural network for prediction. The setting the hidden layer has better representation ability of the of neural network is the same as that in Run #1. original data than SVR, which is also of great importance in supervised learning tasks. • Run #4: We first use the proposed AV-DPE to reduce the original feature space to the 10-D subspace. Then we use the ν-SVR for prediction. The setting of ν-SVR 4. CONCLUSIONS is the same as that in Run #2. In this working notes paper, we have proposed a dimensionality reduction method to extract the emotional Continuous emotion prediction: we downsample the features from movies. By minimizing the distance between size of each video to 64 × 36. As a result, we have a 6912-D data points with similar emotion levels and maximizing feature vector of RGB values for each frame. the distance between data points with different emotion • Run #1: We use the original 6912-D features as the levels simultaneously, the learned subspace keeps most of the input, and then use the neural network for prediction. discriminant information and gives relatively robust results The setting of neural network is the same as that in in both global and continuous emotion prediction tasks. Run #1 of global emotion prediction. • Run #2: We use the original 6912-D features as the Acknowledgments input, and then use the ν-SVR for prediction. The The authors would like to thank the reviewer for the helpful setting of ν-SVR is the same as that in Run #2 of comments. This work was supported in part by the National global emotion prediction. Natural Science Foundation of China under Grant 61503317. 5. REFERENCES [1] http://www.mathworks.com/help/nnet/ref/ fitnet.html?requestedDomain=cn. mathworks.com. [2] Y. Baveye, E. Dellandréa, C. Chamaret, and L. Chen. Liris-accede: A video database for affective content analysis. IEEE Transactions on Affective Computing, 6(1):43–55, Jan 2015. [3] L. Canini, S. Benini, and R. Leonardi. Affective recommendation of movies based on selected connotative features. IEEE Transactions on Circuits and Systems for Video Technology, 23(4):636–647, April 2013. [4] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. [5] E. Dellandréa, L. Chen, Y. Baveye, M. Sjoberg, and C. Chamaret. The mediaeval 2016 emotional impact of movies task. In Mediaeval 2016 Workshop, 2016. [6] G. H. Golub and C. F. Van Loan. Matrix Computations (3rd Ed.). Johns Hopkins University Press, Baltimore, MD, USA, 1996. [7] Y. Liu, Y. Liu, C. Wang, X. Wang, P. Zhou, G. Yu, and K. C. C. Chan. What strikes the strings of your heart? – multi-label dimensionality reduction for music emotion analysis via brain imaging. IEEE Transactions on Autonomous Mental Development, 7(3):176–188, Sept 2015. [8] Y. Liu, Y. Liu, Y. Zhao, and K. A. Hua. What strikes the strings of your heart? – feature mining for music emotion analysis. IEEE Transactions on Affective Computing, 6(3):247–260, July 2015. [9] R. R. Shah, Y. Yu, and R. Zimmermann. Advisor: Personalized video soundtrack recommendation by late fusion with heuristic rankings. In Proceedings of the 22nd ACM International Conference on Multimedia, pages 607–616, 2014. [10] M. Sjoberg, Y. Baveye, H. Wang, V. L. Quang, B. Ionescu, E. Dellandréa, M. Schedl, C.-H. Demarty, and L. Chen. The mediaeval 2015 affective impact of movies task. In Mediaeval 2015 Workshop, 2015. [11] O. Sourina, Y. Liu, and M. K. Nguyen. Real-time eeg-based emotion recognition for music therapy. Journal on Multimodal User Interfaces, 5(1):27–35, 2012. [12] X. Wang, J. Jia, J. Tang, B. Wu, L. Cai, and L. Xie. Modeling emotion influence in image social networks. IEEE Transactions on Affective Computing, 6(3):286–297, July 2015. [13] K. Yadati, H. Katti, and M. Kankanhalli. Cavva: Computational affective video-in-video advertising. IEEE Transactions on Multimedia, 16(1):15–23, Jan 2014. [14] S. Zhang, Q. Huang, S. Jiang, W. Gao, and Q. Tian. Affective visualization and retrieval for music video. IEEE Transactions on Multimedia, 12(6):510–522, Oct 2010.