BUL in MediaEval 2016 Emotional Impact of Movies Task Asim Jan, Yona Falinie A. Gaus, Fan Zhang, Hongying Meng Department of Electronic and Computer Engineering, Brunel University, London {asim.jan, yonafalinie.abdgaus, fan.zhang, hongying.meng}@brunel.ac.uk ABSTRACT at frame level to try and obtain common representations This paper describes our working approach for the Emo- amongst all the videos. It is likely that this will highlight tional Impact of Movies task of MediaEval 2016. There peoples faces and hide the uncommon scenes and objects. are 2 sub-tasks set to make affective predictions, based on The second stage observes the decoded features for varia- Arousal and Valence values, on video clips. Sub-task 1 re- tions within a video sample by using Feature Dynamic His- quires global emotion prediction. Here a framework is de- tory Histogram (FDHH) across the frame level, to produce veloped using Deep Auto-Encoders, a feature variation al- a histogram of patterns that summarize and capture these gorithm and a Deep network. For sub-task 2, a set of au- observations from a set of features. Finally the FDHH fea- dio features are extracted for continuous emotion prediction. tures are used with a regressive model to predict the Arousal Both sub-tasks are approached as a regression problem eval- and Valence scales. uated by Mean Squared Error and Pearson Correlation Co- 2.1.1 Stage 1 - Auto-Encoder: efficient. There are two Deep Auto-Encoders trained, one based on MSE loss and the other on Euclidean loss. Using a similar 1. INTRODUCTION architecture of Fig. 1 found in [10], the Auto-Encoders both The ’Emotional Impact of Movies Task’ comprises two have the same architecture of 4 convolution (Conv) layers sub-tasks with the goal of creating a system that automat- followed by 4 deconvolution (DeConv) layers. Each of the ically predicts the emotional impact on video contents, in Conv and DeConv layers are followed by a Rectified Linear terms of Arousal and Valence, which in a 2-D scale can be Unit (ReLU) activation layer, and at the end is a loss layer. used to describe emotions. Sub-task 1 - Global emotion prediction, predicting a score on induced Valence (negative- 2.1.2 Stage 2 - FDHH Feature: positive) and induced Arousal (calm-excited) for the whole The FDHH algorithm, based on the idea of Motion His- clip; Sub-task 2 - Continuous emotion prediction, predict- tory Histogram (MHH) [8], aims to extract temporal move- ing a score of induced Arousal and Valence continuously for ment across the feature space. This is achieved firstly by each 1s-segment of the video. The development dataset used taking the absolute difference of a feature vector V (n, 1 : c) in both task is the LIRIS-ACCEDE dataset [2]. For the first representing a frame, and its following frame V (n+1, 1 : c) to sub-task, 9800 video excerpts (around 10s) are provided with produce D(n − 1, 1 : c), where n is the frame sample and c is the global Valence and Arousal annotations. For the second the feature dimension. Next, the result calculated of each di- sub-task, 30 movies are provided with the continuous anno- mension from the vector D is compared to a threshold T that tation of Valence and Arousal. Full details on the challenge is set by the user to control the amount of variation to de- tasks and database can be found in [3]. tect, producing a vector of 1’s and 0’s, that represent above and below the threshold. This is repeated for all frames ex- cept the last frame, and a new feature set F (1 : N − 1, 1 : C) 2. METHODOLOGY is produced. Next, each dimension c is observed for patterns m = 1 : M throughout the feature vector F (1 : N − 1, c), 2.1 Framework Summary for Sub-task 1: where a histogram is then produced for each defining pat- The framework is primarily based on visual cues, with tern. A pattern can be defined as the number of consecutive the use of Deep Learning to benefit from the large sample 1’s e.g. m = 1 would look for a pattern ”‘010”’, and m = 2 video dataset. The content of the videos have many differ- would look for ’0110’. The final FDHH Feature will of di- ent scenes making the emotion detection process challeng- mensions FDHH(1 : M, 1 : C). ing. To tackle this, a 3 stage framework has been designed. The first stage of the framework is a Deep Auto-Encoder, 2.1.3 Stage 3 - Regression Models: which can be understood in [1]. This is utilized to try and The final stage of the framework is the regression model, give a representation recreated by a Deep Network. It is in which 2 are utilized. First being a Deep Network trained trained with all video samples and each image is reproduced on the FDHH features using MSE and Euclidean Regression loss function, and the other treats this trained Deep Network as a Pre-Trained feature extractor, and applies Partial Least Copyright is held by the author/owner(s). MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether- Squares (PLS) on the features to predict the Arousal and lands. Valence values. The Deep Network consists of 9 Conv layers, 1 Pooling layer, 8 ReLu Activation Layers and a Loss Layer at the Table 1: Results on sub-task 1 using the proposed end. Training is done using Support Gradient Descent until framework Run Arousal Valence 100 epochs, with the weights initialized using the Xavier MSE PCC MSE PCC method [5]. 1 1.462 0.251 0.236 0.143 The Pre-Trained features are extracted from the 100th epoch network, which are concatenated with Audio descrip- 2 1.441 0.271 0.237 0.144 tors that are mentioned in Section 2.2.1, however they are 3 1.525 0.143 0.236 0.125 based on a whole video clip samples rather than 1s segments. 4 1.443 0.248 0.231 0.146 These features are concatenated, Rank Normalized between 5 1.431 0.263 0.231 0.149 0 and 1, and then PLS regression is used. 2.2 Framework Summary for Sub-task 2: Table 2: Results on sub-task 2 using proposed framework 2.2.1 Stage 1 - Audio Descriptors: Run Arousal Valence The Audio descriptors are extracted from openSMILE MSE PCC MSE PCC software [9] which include 16 low-level descriptor (LLD) as 1 0.157 -0.072 0.173 -0.042 follows:root mean square (RMS) frame energy; zero-crossing- 2 0.129 -0.013 0.141 -0.007 rate (ZCR); harmonics-to-noise ratio (HNR); pitch frequency 3 0.182 -0.039 0.175 -0.062 (F0); mel-frequency cepstral coefficients (MFCC) 1-12. For each LLDs, 12 functionals mean, standard deviation, kurto- sis, skewness, minimum and maximum value, relative posi- model. tion, and range as well as two linear regression coefficients Run 4 and Run 5 Deep Network: This run is based with their mean square error (MSE) are also computed. In on training an Auto-Encoder with a Euclidean loss function total, the number of features per 1s-segment are (16 x 2 x (EUC Loss) and MSE loss function (MSE Loss). They are 12) = 384 attributes. also followed by FDHH feature extraction, with Deep Net- works trained on the FDHH features as a regressive model, 2.2.2 Stage 2 - Regression Models: using Euclidean and MSE regressive loss functions respec- A total of 3 regressive models are trained on the audio tively. descriptors. These are mentioned in the following: For Sub-Task 2, for each configuration, 3 different runs Run 1 Linear Regression + Gaussian smoothing were selected, as explained in Section 2.2. (LR+Gs): After obtaining the predicted label from the re- gression stage, a smoothing operation is performed, using Gaussian filtering with a window size of 10. The smoothing 4. RESULTS AND DISCUSSIONS window is carefully selected, in order to retain the pattern On the official test results, each of the sub-task were evalu- of the labels whilst increasing the performance. It is re- ated using Mean Squared Error (MSE) and Pearson Correla- quired for removing the high frequency noise irrelevant to tion Coefficient (PCC). For Sub-Task 1, Table 1, the results the affective dimensions. show a strong performance for Run 5, using a trained Deep Run 2 Partial Least Square (PLS): PLS is a statisti- Network with Euclidean loss and no Audio descriptors. It is cal algorithm that bears some relation to principal compo- closely matched by Run 4, the identical configuration using nents regression. Previous EmotiW 2015 employed PLS in MSE loss to train the Deep Networks. The Audio descrip- the systems, which gave better results than the baseline [6] tors have shown the weakest performance of them all, with [7]. the possibility of increasing the errors of Runs 1 and 2, as Run 3 Least Square Boosting + Moving Aver- they use Audio fusion. In terms of Training loss functions age smoothing (LSB + MAs): LSB is regression model (EUC VS MSE), when comparing Runs 1 VS 2 and Runs trained with gradient boosting [4]. In this model, the num- 4 VS 5, there is a performance boost for Euclidean loss in ber of regression trees in the ensemble is chosen as 500 on a most cases, but only marginal. For Sub-Task 2, Table 2, training set. After obtaining the prediction labels, a smooth- PLS gives lowest MSE but LR+Gs gives highest results on ing operation is performed using a moving average filter, in PCC. However all algorithms perform unacceptably bad on order to increase the performance. Valence, a situation that requires further investigation. 3. EXPERIMENTAL SETUP 5. CONCLUSIONS 5 different runs were made based on the framework Sub- In this working notes paper, we proposed a different frame- task 1, and 3 different runs for Sub-task 2, which are: work for each sub-task. The frameworks are composed on Run 1 & Run 2 Deep Audio PLS: These runs uti- feature extraction using deep learning, FDHH for capturing lize a trained Auto-Encoder with a Euclidean loss function Feature variations across the Deep Features at frame level, (EUC Loss) and MSE loss function (MSE Loss), followed by and finally audio descriptors taken from the speech signal. FDHH feature extraction. Finally trained Deep Networks, Several machine learning algorithms were also implemented with Euclidean and MSE loss respectively, are used as a Pre- as a regression model. The official test results show that fea- Trained Feature Extractors. These features are fused with tures proposed by the framework are informative, give good Audio features and a PLS regression model is trained. results in terms of MSE and PCC in sub-task 1 and good Run 3 Audio + PLS: This run is based on using just the results in terms of MSE in sub-task 2. The future work will openSMILE Audio descriptors, along with a PLS regression focus on the dynamic relationship of the emotion data. 6. REFERENCES [1] P. Baldi. Autoencoders, unsupervised learning, and deep architectures. In Unsupervised and Transfer Learning - Workshop held at ICML 2011, Bellevue, Washington, USA, July 2, 2011, pages 37–50, 2012. [2] Y. Baveye, E. Dellandréa, C. Chamaret, and L. Chen. LIRIS-ACCEDE: A video database for affective content analysis. IEEE Transactions on Affective Computing, 6(1):43–55, 2015. [3] E. Dellandréa, L. Chen, Y. Baveye, M. Sjöberg, C. Chamaret, and E. C. D. Lyon. The MediaEval 2016 Emotional Impact of Movies Task. pages 3–5, 2016. [4] J. H. Friedman. Stochastic gradient boosting. Computational Statistics and Data Analysis, 38(4):367–378, 2002. [5] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS10). Society for Artificial Intelligence and Statistics, 2010. [6] H. Kaya and A. A. Salah. Contrasting and Combining Least Squares Based Learners for Emotion Recognition in the Wild Contrasting and Combining Least Squares Based Learners for Emotion Recognition in the Wild. (November):459–466, 2015. [7] M. Liu, R. Wang, Z. Huang, S. Shan, and X. Chen. Partial least squares regression on grassmannian manifold for emotion recognition. Proceedings of the 15th ACM on International conference on multimodal interaction - ICMI ’13, pages 525–530, 2013. [8] H. Meng, N. Pears, M. Freeman, and C. Bailey. Motion history histograms for human action recognition. In Embedded Computer Vision, pages 139–162. 2009. [9] F. Weninger. open-Source Media Interpretation by Large feature-space Extraction. (December), 2014. [10] Y. Zhou, D. Arpit, I. Nwogu, and V. Govindaraju. Is Joint Training Better for Deep Auto-Encoders? ArXiv e-prints, May 2014.