GLA in MediaEval 2018 Emotional Impact of Movies Task Jennifer J. Sun1 , Ting Liu, Gautam Prasad Google LLC jjsun@caltech.edu,{liuti,gautamprasad}@google.com ABSTRACT with late fusion, and valence along with arousal was predicted We present our methods for the MediaEval 2018 Emotional jointly. Our method is implemented using TensorFlow and Impact of Movies Task to predict the expected valence and we used the Adam optimizer [12] in all our experiments. arousal continuously in movies. Our approach leverages image, audio, and face based features computed using pre-trained 2.1 Feature Extraction neural networks. These features were computed over time We extracted image, audio and face features from each frame and modeled using a gated recurrent unit (GRU) based of the movies. Our image features (Inception-Image) were network followed by a mixture of experts model to compute from the Inception network [19] pre-trained on ImageNet [16]. multiclass predictions. We smoothed these predictions using We extracted audio features using AudioSet [9], which is a a Butterworth filter for our final result. VGG-inspired model pre-trained on YouTube-8M [1]. For the face features (Inception-Face), we focused on the two largest 1 INTRODUCTION faces in each frame and used an Inception based architecture trained on faces [17]. Since the movies were human-focused, The Emotional Impact of Movies Task [7], part of the Media- faces were found in most of the scenes. We compared these Eval 2018 benchmark, provides participants with a common features with those used in last years competition that in- dataset for predicting the expected emotional impact from cluded image features computed using VGG16 [18] and audio videos. We focused on the first subtask in the challenge: features computed using openSMILE [8]. All our features predicting the expected valence and arousal continuously were extracted at one frame per second. (every second) in movies. The dataset provided by the task is the LIRIS-ACCEDE dataset [3, 4], which is annotated with self-reported valence and arousal every second from multiple 2.2 Temporal Models annotators. Since deep neural networks, such as Inception To model the temporal dynamics of the emotion in the videos, [19], have millions of trainable parameters, the competition we used recurrent neural networks. In particular, we used data may be too limited to train these networks from ran- LSTMs [10] and GRUs [6] as part of our modeling pipeline dom initializations. Therefore, we used networks that were in a sequence-to-one setup with sequence length of 10, 30 pre-trained on larger datasets, such as ImageNet, to extract or 60 seconds. The self-reported emotions likely depend on features from the LIRIS-ACCEDE dataset. The extracted past scenes in movies, so temporal modeling is important features were used to train our temporal and regression mod- for this task. In addition, we evaluated TCNs because of els. their promising performance in sequence modeling [2] and This task is recurring with multiple submissions every action segmentation [13]. Specifically, we trained an encoder- year [11, 14]. Our method’s novelty lies in the unique set decoder TCN using sequences of extracted features to obtain of features we extracted including image, audio, and face a sequence of valence and arousal predictions. features (capitalizing on transfer learning) along with our model setup, which comprises of a GRU combined with a 2.3 Regression Models mixture of experts. The input from each modality (image, audio, or face) is fed into separate recurrent models. The output we use from each 2 APPROACH recurrent model is its hidden state which contains information We approached the valence and arousal prediction as a multi- on previous data seen by the model. We use the hidden state variate regression problem. Our objective is to minimize the corresponding to the final timestamp in the input sequence. multi-label sigmoid cross-entropy loss and this could allow The state vectors for each modality are concatenated into a the model to use potential relationships between the two single vector and fed into a context gate [15]. Multimodal dimensions for regression. We first used pre-trained networks fusion occurs at this stage as we use the learned model to to extract features. To model the temporal aspects of the fuse the representation of each modality from the RNN. The data, the methods we evaluated included long short-term output of the context gate is then fed into a mixture-of- memory (LSTM), gated recurrent unit (GRU) and temporal experts model with another context gate to obtain the final convolutional network (TCN). Multiple modalities were fused emotion predictions. We use logistic regression experts with a softmax gating network. 1 This work completed during Jennifer’s internship at Google. To prevent overfitting, we regularized our models using L2 Copyright held by the owner/author(s). MediaEval’18, 29-31 October 2018, Sophia Antipolis, France regularization, dropout and batch normalization. Finally, a low pass filter is applied on the predictions to smooth the MediaEval’18, 29-31 October 2018, Sophia Antipolis, France J. Sun et al. prediction outputs. In LIRIS-ACCEDE, the measured emo- Table 1: Performance of our five models. tion data vary smoothly in time but our regression outputs contain high frequency signals. To smooth our outputs, we Valence Arousal tested weighted moving average filters and low-pass filters MSE PCC MSE PCC (Butterworth filter [5]) as implemented in SciPy. Run 1 0.1193 0.1175 0.1384 0.2911 Run 2 0.0945 0.1376 0.1479 0.1957 3 RESULTS AND ANALYSIS Run 3 0.1133 0.1883 0.1778 0.2773 We optimized the hyperparameters of our models to have Run 4 0.1073 0.2779 0.1396 0.3513 the best performance on the validation set, which consists Run 5 0.0837 0.1786 0.1334 0.3358 of 13 movies from the development set. We then trained our models on the entire development set to run inference on the test set. Our setup used a batch size of 512. Through evaluating our recurrent models, we found that is likely because by averaging, we decrease the variance of Inception-Image+AudioSet features had better performance predictions and thus overall, the mean is closer to the ground in terms of MSE and PCC compared to VGG16+openSMILE truth labels. While averaging improves MSE, it does not features. In some cases, the recurrent model would predict improve correlation. Our model with the best correlation is near the mean for both valence and arousal while using from Run 4. VGG16+openSMILE. This may be because the features did We note that using batch normalization during inference not have enough information for the models to discriminate increases the variance of our predictions. This is because between different values of valence and arousal. We also we are using the batch statistics instead of the population found a significant increase in performance when we added statistics from the train set to normalize the batches. Our the Inception-Face features, which may point to salient infor- validation results (with repeated runs) as well as test results mation captured in connection with the expected emotions show that using batch normalization in this way improves in the videos. predictions for valence, but not as much for arousal. This is The sequence-to-one recurrent models worked best with most likely because the statistics of the test set for valence is longer input sequences of 60 seconds versus those of 10 or different from train set while the test set statistics for arousal 30 seconds. This may be because the invoked emotion is may be closer to the train set statistics. One explanation affected by longer lasting scenes. Our recurrent models also could be the small size of the dataset so that the statistics of performed better on the validation set than the TCN and the train set does not generalize well to the test set. the GRU models had similar performance to the LSTMs. We used GRUs for our implementation because GRUs are 4 CONCLUSIONS computationally simpler than LSTMs. Since we have a small We found that precomputed features modeling image, audio, dataset, we wanted to reduce model complexity to prevent and face in concert with GRUs provided the optimal per- underfitting. Our temporal model architecture ranged from formance in predicting the expected valence and arousal in 32 to 256 units and 1 to 2 layers, optimized for each of the movies for this task. Based on our test set metrics, ensemble modalities. For post processing, the low-pass Butterworth methods such as bagging could be useful for this task. filter worked better than the moving average filter. This is We found some evidence that recurrent models performed likely because the Butterworth filter is designed to have a better than TCN. However since we only evaluated the frequency response as flat as possible (with no ripples) in the encoder-decoder TCN more investigation will be necessary pass-band. Fluctuations of the magnitude response within for a broader conclusion. the passband may decrease the accuracy of our regression The pre-computed features we used to model image, au- output. dio, and face information showed better performance when In Table 1 we list the performance of our best models that compared with the VGG16+openSMILE baseline. A future were submitted to the task. Each of the 5 runs is defined direction could be to train the network in an end-to-end man- as follows where we used Inception-Image, AudioSet, and ner to better capture the frame level data, with the caveat Inception-Face as the features and a GRU with mixture-of- that we may need a much larger training dataset. experts for regression. (1) No dropout or batch normalization. REFERENCES (2) Regularized with dropout and batch normalization. [1] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Trained on approximately 70% of the data. Natsev, George Toderici, Balakrishnan Varadarajan, and (3) Regularized with dropout and batch normalization. Sudheendra Vijayanarasimhan. 2016. Youtube-8m: A (4) Regularized with dropout and batch normalization, large-scale video classification benchmark. arXiv preprint different initialization and epoch. arXiv:1609.08675 (2016). [2] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. 2018. (5) Average over all runs. An empirical evaluation of generic convolutional and re- We see that creating an ensemble from our models by current networks for sequence modeling. arXiv preprint averaging over the runs has the lowest MSE (Run 5). This arXiv:1803.01271 (2018). Emotional Impact of Movies Task MediaEval’18, 29-31 October 2018, Sophia Antipolis, France [3] Yoann Baveye, Emmanuel Dellandréa, Christel Chamaret, [19] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon and Liming Chen. 2015. Deep learning vs. kernel methods: Shlens, and Zbigniew Wojna. 2016. Rethinking the incep- Performance for emotion prediction in videos. In Affective tion architecture for computer vision. In Proceedings of the Computing and Intelligent Interaction (ACII), 2015 Interna- IEEE conference on computer vision and pattern recognition. tional Conference on. IEEE, 77–83. 2818–2826. [4] Yoann Baveye, Emmanuel Dellandrea, Christel Chamaret, and Liming Chen. 2015. Liris-accede: A video database for affective content analysis. IEEE Transactions on Affective Computing 6, 1 (2015), 43–55. [5] Stephen Butterworth. 1930. On the theory of filter amplifiers. Wireless Engineer 7, 6 (1930), 536–541. [6] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations us- ing RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014). [7] Emmanuel Dellandréa, Martijn Huigsloot, Liming Chen, Yoann Baveye, and Mats Viktor Sjöberg. 2018. The mediaeval 2018 emotional impact of movies task. In MediaEval 2018 Multimedia Benchmark Workshop Working Notes Proceed- ings of the MediaEval 2018 Workshop. [8] Florian Eyben, Felix Weninger, Florian Gross, and Björn Schuller. 2013. Recent developments in opensmile, the munich open-source multimedia feature extractor. In Proceedings of the 21st ACM international conference on Multimedia. ACM, 835–838. [9] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human- labeled dataset for audio events. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Con- ference on. IEEE, 776–780. [10] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short- term memory. Neural computation 9, 8 (1997), 1735–1780. [11] Zitong Jin, Yuqi Yao, Ye Ma, and Mingxing Xu. 2017. THUHCSI in MediaEval 2017 Emotional Impact of Movies Task. Proc. MediaEval (2017). [12] Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). [13] Colin Lea, Rene Vidal, Austin Reiter, and Gregory D Hager. 2016. Temporal convolutional networks: A unified approach to action segmentation. In European Conference on Computer Vision. Springer, 47–54. [14] Yang Liu, Zhonglei Gu, and Tobey H Ko. 2017. HKBU at MediaEval 2017 Emotional Impact of Movies Task. In Mediaeval 2017 Workshop. Dublin, Ireland. [15] Antoine Miech, Ivan Laptev, and Josef Sivic. 2017. Learnable pooling with Context Gating for video classification. arXiv preprint arXiv:1706.06905 (2017). [16] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, and others. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211–252. [17] Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815–823. [18] Karen Simonyan and Andrew Zisserman. 2014. Very deep con- volutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).