1. INTRODUCTION

Mining Emotional Features of Movies

Yang Liu

2 4

Zhonglei Gu

Yu Zhang

zhangyu@cse.ust.hk 1

Yan Liu

3 0 AAOO Tech Limited , Hong Kong SAR , China 1 Department of CSE, Hong Kong University of Science and Technology , Hong Kong SAR , China 2 Department of Computer Science, Hong Kong Baptist University , Hong Kong SAR , China 3 Department of Computing, The Hong Kong Polytechnic University , Hong Kong SAR , China 4 Institute of Research and Continuing Education, Hong Kong Baptist University , Shenzhen , China

2016

20 21

In this paper, we present the algorithm designed for mining emotional features of movies. The algorithm dubbed Arousal-Valence Discriminant Preserving Embedding (AVDPE) is proposed to extract the intrinsic features embedded in movies that are essentially di erentiating in both arousal and valence directions. After dimensionality reduction, we use the neural network and support vector regressor to make the nal prediction. Experimental results show that the extracted features can capture most of the discriminant information in movie emotions.

1. INTRODUCTION

A ective multimedia content analysis aims to automatically recognize and analyze the emotions evoked by multimedia data such as images, music, and videos. It has a lot of real-world applications such as image search, movie recommendation, and music classi cation [3, 7{9, 11{14].

In this 2016 Emotional Impact of Movies Task, the participants are required to design algorithms to predict the arousal and valence values of the given movies automatically. The dataset used in this task is the LIRIS-ACCEDE dataset (liris-accede.ec-lyon.fr). It contains videos from a set of 160 professionally made and amateur movies, shared under the Creative Commons licenses that allow redistribution [2]. More details of the task requirements as well as the dataset description can be found in [5, 10].

In this paper, we perform both global and continuous emotion predictions via a proposed supervised dimensionality reduction algorithm called Arousal-Valence Discriminant Preserving Embedding (AV-DPE), which learns the compact representations of the original data. After obtaining the lowdimensional features, we use the neural network and support vector regressor to predict the emotion values.

PROPOSED METHOD

In order to derive the intrinsic factors in movies that convey or evoke emotions along the arousal and valence dimensions, we propose a supervised feature extraction algorithm dubbed Arousal-Valence Discriminant Preserving Embedding (AV-DPE) to map the original high-dimensional representations into a low-dimensional feature subspace, in which the data with similar A-V values are close to each other, while the data with di erent A-V values are faraway from each other.

Let x 2 RD be the high-dimensional feature vector of the movie, and y = [y( 1 ); y( 2 )] be the corresponding emotion label vector, where y( 1 ) and y( 2 ) denote the arousal value and valence value, respectively. Given the training set f(x1; y1); :::; (xn; yn)g, AV-DPE aims at learning a transformation matrix U = [u1; :::; ud] 2 RD d which is able to project the original D-dimensional data to an intrinsically low-dimensional subspace Z = Rd.

In order to describe the similarity between data samples, we de ne the following adjacency scatter matrix: n n Sa = X X Aij (xi i=1 j=1 xj)(xi xj)T ; where Aij denotes the similarity between the i-th and j-th data points. In our formulation, we use the form of inner product between the corresponding label vectors associated with xi and xj. To further normalize the similarity values into interval [0; 1], we de ne the normalized adjacency matrix A^ where ^

Aij = hy^i; y^ji = hyi=jjyijj; yj=jjyjjji: The normalized adjacency scatter matrix is then de ned as: ( 1 ) ( 2 ) ( 3 ) ( 5 ) n n S^a = X X A^ij (xi i=1 j=1 n n S^d = X X D^ ij (xi i=1 j=1 xj)(xi

xj)T : xj)(xi xj)T ; ( 4 )

Similarly, we de ne the normalized discriminant scatter matrix to characterize the dissimilarity between data points: where we simply de ne D^ ij = 1 A^ij .

In order to maximize the distance between data points with di erent labels while minimizing the distance between data points with similar labels, the objective function of AVDPE is formulated as follows:

U = arg maxftr((UT S^aU)yUT S^dU)g;

U where tr( ) denotes the matrix trace operation and (S^a)y denotes the Moore-Penrose pseudoinverse of S^a [6]. The optimization problem in Eq. ( 5 ) can be solved by some standard matrix decomposition techniques [6]. #1 #2 #3 #4

EXPERIMENTS

In this section, we report the experimental settings and the evaluation results.

Global emotion prediction: we construct a 34-D feature set, including alpha, asymmetry env, colorfulness, colorRawEnergy, colorStrength, compositionalBalance, cutLength, depthOfField, entropyComplexity, atness, globalActivity, hueCount, lightning, maxSaliencyCount, medianLightness, minEnergy, nbFades, nbSceneCuts, nbWhiteFrames, saliencyDisparity, spatialEdgeDistributionArea, wtf max2stdratio f1-12g and zcr. Note that all above features are provided by the task organizers.

Run #1: We use the original 34-D features as the input, and then use a function tting neural network [1] with 100 nodes in the hidden layer for prediction. The Levenberg-Marquardt backpropagation function is used in training.

Run #2: We use the original 34-D features as input, and then use the -support vector regression ( -SVR) for prediction. In -SVR, the RBF kernel is utilized with the default setting from LIBSVM [4], i.e., cost = 1, = 0:5, and is then set to be the reciprocal of the number of feature dimension.

Run #3: We rst use the proposed AV-DPE to reduce the original feature space to the 10-D subspace. Then utilize the neural network for prediction. The setting of neural network is the same as that in Run #1. Run #4: We rst use the proposed AV-DPE to reduce the original feature space to the 10-D subspace. Then we use the -SVR for prediction. The setting of -SVR is the same as that in Run #2.

Continuous emotion prediction: we downsample the size of each video to 64 36. As a result, we have a 6912-D feature vector of RGB values for each frame.

Run #1: We use the original 6912-D features as the input, and then use the neural network for prediction. The setting of neural network is the same as that in Run #1 of global emotion prediction.

Run #2: We use the original 6912-D features as the input, and then use the -SVR for prediction. The setting of -SVR is the same as that in Run #2 of global emotion prediction.

Run #3: We rst use the proposed AV-DPE to reduce the original high-dimensional feature space to the 100D subspace. Then we use the neural network for prediction. The setting of neural network is the same as that in Run #1 of global emotion prediction. Run #4: We rst use the proposed AV-DPE to reduce the original high-dimensional feature space to the 100D subspace. Then we use the -SVR for prediction. The setting of -SVR is the same as that in Run #2 of global emotion prediction.

Table 1 and Table 2 report the results of our system. From the tables we can see that after dimensionality reduction, the performance of the reduced features (Run #3 and Run #4) is generally worse than that of the original features (Run #1 and Run #2), which indicates that the emotion information in movies is relatively complex, and thus we may not be able to fully describe it using just a few dimensions. However, considering that the dimension of the reduced features is much less than that of the original features, we still can conclude that the learned subspace preserves rich discriminant information of the original feature space.

Moreover, from both tables we can observe that the neural network performs more robust than SVR after dimensionality reduction. The possible reason is that besides the discriminant ability, the neural network with the hidden layer has better representation ability of the original data than SVR, which is also of great importance in supervised learning tasks. 4.

CONCLUSIONS

In this working notes paper, we have proposed a dimensionality reduction method to extract the emotional features from movies. By minimizing the distance between data points with similar emotion levels and maximizing the distance between data points with di erent emotion levels simultaneously, the learned subspace keeps most of the discriminant information and gives relatively robust results in both global and continuous emotion prediction tasks.

Acknowledgments

The authors would like to thank the reviewer for the helpful comments. This work was supported in part by the National Natural Science Foundation of China under Grant 61503317.

[1] http://www.mathworks.com/help/nnet/ref/ fitnet.html?requestedDomain=cn. mathworks.com.

[2]

Baveye ,

Dellandrea ,

Chamaret , and

Chen . Liris-accede: A video database for a ective content analysis . IEEE Transactions on A ective Computing , 6 ( 1 ): 43 { 55 , Jan 2015 .

[3]

Canini ,

Benini , and

Leonardi . A ective recommendation of movies based on selected connotative features . IEEE Transactions on Circuits and Systems for Video Technology , 23 ( 4 ): 636 { 647 , April 2013 .

[4]

C.-C.

Chang and

C.-J.

Lin . LIBSVM: A library for support vector machines . ACM Transactions on Intelligent Systems and Technology , 2 : 27 :1{ 27 : 27 , 2011 .

[5]

Dellandrea ,

Chen ,

Baveye ,

Sjoberg , and

Chamaret . The mediaeval 2016 emotional impact of movies task . In Mediaeval 2016 Workshop , 2016 .

[6]

G. H.

Golub and

C. F. Van Loan. Matrix

Computations (3rd Ed.). Johns Hopkins University Press, Baltimore, MD , USA, 1996 .

[7]

Liu ,

Wang ,

Zhou ,

Yu , and

K. C. C.

Chan . What strikes the strings of your heart? { multi-label dimensionality reduction for music emotion analysis via brain imaging . IEEE Transactions on Autonomous Mental Development , 7 ( 3 ): 176 { 188 , Sept 2015 .

[8]

Liu ,

Zhao , and

K. A.

Hua . What strikes the strings of your heart? { feature mining for music emotion analysis . IEEE Transactions on A ective Computing , 6 ( 3 ): 247 { 260 , July 2015 .

[9]

R. R.

Shah ,

Yu , and

Zimmermann . Advisor: Personalized video soundtrack recommendation by late fusion with heuristic rankings . In Proceedings of the 22nd ACM International Conference on Multimedia , pages 607 { 616 , 2014 .

[10]

Sjoberg ,

Baveye ,

Wang ,

V. L.

Quang ,

Ionescu , E. Dellandrea,

Schedl , C.-H. Demarty , and L. Chen. The mediaeval 2015 a ective impact of movies task . In Mediaeval 2015 Workshop , 2015 .

[11]

Sourina ,

Liu , and

M. K.

Nguyen . Real-time eeg-based emotion recognition for music therapy . Journal on Multimodal User Interfaces , 5 ( 1 ): 27 { 35 , 2012 .

[12]

Wang ,

Jia ,

Tang ,

Wu ,

Cai , and

Xie . Modeling emotion in uence in image social networks . IEEE Transactions on A ective Computing , 6 ( 3 ): 286 { 297 , July 2015 .

[13]

Yadati ,

Katti , and

Kankanhalli . Cavva: Computational a ective video-in-video advertising . IEEE Transactions on Multimedia , 16 ( 1 ): 15 { 23 , Jan 2014 .

[14]

Zhang ,

Huang ,

Jiang ,

Gao , and

Tian . A ective visualization and retrieval for music video . IEEE Transactions on Multimedia , 12 ( 6 ): 510 { 522 , Oct 2010 .