=Paper=
{{Paper
|id=Vol-2283/MediaEval_18_paper_21
|storemode=property
|title=Visual and Audio Analysis of Movies Video for Emotion Detection @ Emotional Impact of Movies Task MediaEval 2018
|pdfUrl=https://ceur-ws.org/Vol-2283/MediaEval_18_paper_21.pdf
|volume=Vol-2283
|authors=Elissavet Batziou,Emmanouil Michail,Konstantinos Avgerinakis,Stefanos Vrochidis,Ioannis Patras,Ioannis Kompatsiaris
|dblpUrl=https://dblp.org/rec/conf/mediaeval/BatziouMAVPK18
}}
==Visual and Audio Analysis of Movies Video for Emotion Detection @ Emotional Impact of Movies Task MediaEval 2018==
Visual and audio analysis of movies video for emotion detection @ Emotional Impact of Movies task MediaEval 2018 Elissavet Batziou1 , Emmanouil Michail1 , Konstantinos Avgerinakis1 , Stefanos Vrochidis1 , Ioannis Patras2 , Ioannis Kompatsiaris1 1 Information Technologies Institute, Centre for Research and Technology Hellas 2 Queen Mary University of London batziou.el@iti.gr,michem@iti.gr,koafgeri@iti.gr stefanos@iti.gr,i.patras@qmul.ac.uk,ikom@iti.gr ABSTRACT This work reports the methodology that CERTH-ITI team devel- oped so as to recognize the emotional impact that movies have to its viewers in terms of valence/arousal and fear. More Specifically, deep convolutional neural newtworks and several machine learn- ing techniques are utilized to extract visual features and classify them based on the predicted model, while audio features are also taken into account in the fear scenario, leading to highly accurate recognition rates. Figure 1: Block diagram of our approach for fear recognition 1 INTRODUCTION Emotion based content have a large number of applications, includ- ing emotion-based personalized content delivery[2], video indexing[7], 2 APPROACH summarization[5] and protection of children from potentially harm- ful video content. Another intriguing trend that appears to get a 2.1 Valence-Arousal Subtask lot of attention lately is style transferring and more specifically rec- In the valence-arousal recognition subtask, keyframe extraction ognizing the emotion from some painting or some specific section is initially applied so as to extract one video frame per second from a movie and transferring its affect to the viewer as a style to a and correlate them with the annotations that were provided from novel creation. MediaEval emotion organizers, who has also used the same time Emotional Impact of Movies Task is a challenge of MediaEval interval to record human extracted groundtruth data. The provided 2018 that comprises of two subtasks: (a) Valence/Arousal prediction visual features are then concatenated into one vector representation and (b) Fear prediction from movies. The Task provides a great so as to have a common and fixed representation scheme throughout amount of movies video, their visual and audio features and also different video samples. their annotations[1].Both subtasks ask from the participants to The first recognition approach that was deployed concerns the leverage any available technology, in order to determine when and valence/arousal estimation by adopting a linear regression model. whether fear scenes occur and to estimate a valence-arousal score Linear regression try to minimize the residual sum of squares be- for each video frame in the provided test data [3]. tween the groundtruth and predicted responses by using linear In this work, CERTH-ITI introduces its algorithms for valence/arousal approximation (Run 3). PCA is also deployed on our final visual and fear recognition subtasks, which include the deployment of features vectors so as to reduce their dimensionality and keep only deep learning and other classification schemes to recognize the de- the most discriminant principal components (in our case the first sired outcome. More specifically, a 3-layer neural network(NN) and 2000) to represent all features (Run 4). a simple linear regression model are deployed, with and without A Neural Network (NN) framework has also been deployed so as PCA, so as to predict the correct emotion in the valence-arousal to fulfil the valence/arousal recognition substask. For that purposes, subtask, while a pre-trained V GG16 model [6] is combined with a K a 3-hidden layer NN with ReLU activation function and Adam opti- Nearest Neighbors (KNN)- classification scheme, so as to leverage mizer with learning rate = 0.001 was deployed. The size of each the visual and audio attributes respectively and identify the correct hidden layer is 64, 32 and 32 respectively. We use batch size equal boundary video frames in the fear subtask. to 10 and 10 epochs. The size of the training set is 2/3 of the de- velopment set and the remaining 1/3 for validation set. The input of the NN is the set of vectors of concatenated visual features(Run 3). PCA has also been used in order to downsample the concate- Copyright held by the owner/author(s). MediaEval’18, 29-31 October 2018, Sophia Antipolis, France nated highly dimensional size (5367) in the golden section of 2000 principal components(Run 4). MediaEval’18, 29-31 October 2018, Sophia Antipolis, France E. Batziou et al. 2.2 Fear Subtask Table 1: CERTH-ITI Predictions For the fear recognition subtask, we initially keyframe extraction every one second, as we perform in valance subtask. The frames Valence Arousal Fear annotated as "fear" were significantly less than the "no-fear" class Run MSE r MSE r IoU and, therefore, in order to balance our dataset we used data augmen- tation techniques. Firstly, we downloaded from Flickr about 10, 000 1 396901706.564 0.079 1678218552.19 0.054 0.075 images with tag "fear" and we also download emotion images 1 and 2 0.139 0.010 0.181 -0.022 0.065 kept those which are annotated as "fear". In order to further increase 3 0.117 0.098 0.138 nan 0.053 the number of fear frames, we additionally use data augmentation 4 0.142 0.067 0.187 -0.029 0.063 techniques on the provided annotated frames. We randomly rotate and translate pictures vertically or horizontally and we randomly apply shearing transformations, randomly zooming inside pictures, redundant noise and keep the most important features. Moreover, flipping half of the images horizontally and filling in newly created there is a "NaN" score for the Pearson measure in the arousal predic- pixels which can appear after a rotation or a width/height shift. tion scores, because we accidentally set the training value stable and Finally, we reduce the set of no-fear frames. After these, we had so our model predicts the same score for all frames, but this score about 23,000 "fear" and 30,000 tagged as "no fear" images to train does not characterize our model, since it does not appear in any our model. We used transfer learning to gain information from a other prediction within the valence/arousal prediction sub-task. large scale dataset and also trained our model in a very realistic We have also submitted 4 runs for fear prediction subtask and and efficient time. The architecture that we chose to represent our their results are also presented in Table 1 and are evaluated in terms features is the V GG16 pre-trained on Places2 dataset [8] because of Intersection over Union (IoU). From Table 1 we see that, the best the majority of the movies have places as background and so we performance for the fear recognition subtask is Run 1, using all assume that it would be helpful. We use Nadam optimizer with predicted scores of the pre-trained VGG16 model. In addition, our learning rate 0.0001. The batch size is 32 and the number of epochs intuition to remove isolated predicted frames (Run 2), as they are 50. Finally, we set a threshold of 0.4 on their probability (Run 1). In a not associated with any duration, did not perform better than Run different approach, we used the same architecture without isolated 1, hence we miss significant information (video frames that invoke predicted frames (Run2). fear). Additionally, in order to exploit auditory information, we devel- oped a classification method applied on audio features already ex- 4 DISCUSSION AND OUTLOOK tracted from the challenge committee using openSmile toolbox [4]. In this paper we report the CERTH-ITI team approach to the Media- Audio feature vectors, consisting of 1582 features, extracted from Eval 2018 Challenge "Emotional Impact of Movies" task. The results videos every second, were separated into training (80%) and vali- in valence/arousal prediction subtask shows that according to MSE, dation set (20%). In order to equalize the size of the two classes in the best result obtained in Run 3 for both valence and arousal, while the training set we randomly removed "no-fear" samples. We apply regarding to Pearson Correlation Coefficient, Run 1 has the best KNN classification method with N=3 on the test set, results were performance for arousal and the second best performance for va- further processed, in order to remove erroneous false negatives lence. The Pearson correlation is able to measure linear correlations (single "no-fear" samples around "fear" areas) and false positives between two or more variables. However, the MSE is obtained by (isolated small "fear" areas consisting of one or two "fear" samples). a sum of squared deviations between predicted and ground-truth Results from visual and audio analysis were submitted both values, no matter if they are linearly correlated or not. separately, as different runs, and in combination by taking the post The results of the fear prediction subtask shows that the inclusion probabilities of visual and auditory classifications and setting a of audio features failed to enhance the classification performance, threshold of 0.7 on their average probability. The overall block as expected. This could be due to several reasons, with the promi- diagram of this approach is depicted in Figure 1. nent one to be the incapability of performing data augmentation on audio features such as in the case of visual analysis. Both the 3 RESULTS AND ANALYSIS aforementioned reason and the large inequality between the two We have submitted 4 runs for valence/arousal prediction and their classes, which led us discard many "no-fear" annotations, in or- results are introduced in Table 1. In the experiments two evalua- der to balance the training set, resulted on a very limited training tion measures are used: (a) Mean Square Error (MSE) and (b) Pear- set. These drawbacks could be overcome by using classification son Correlation Coefficient (r). We observe that the N N approach methods able to handle unbalanced training sets, such as penalized that we describe in the previous section has the best performance models, or by enriching the training set with external annotated amongst all the others. Furthermore, it is worth mentioning that datasets and by exploring more efficient fusion methods, such as the linear regression model produces some extremely high scores, performing classification on fused audiovisual features, instead of probably because the original feature vectors weren’t neither dis- a posterior combining separate classification results. criminative nor adequate enough to create the regression model. However, PCA projection to lower dimensional space, with higher ACKNOWLEDGMENTS discriminative power show to solve this problem as it reduces the This work was funded by the EC-funded project V4Design under 1 http://www.imageemotion.org/ the contract number H2020-779962 Emotional Impact of Movies Task MediaEval’18, 29-31 October 2018, Sophia Antipolis, France REFERENCES [1] Yoann Baveye, Emmanuel Dellandrea, Christel Chamaret, and Lim- ing Chen. 2015. Liris-accede: A video database for affective content analysis. IEEE Transactions on Affective Computing 6, 1 (2015), 43–55. [2] Luca Canini, Sergio Benini, and Riccardo Leonardi. 2013. Affective recommendation of movies based on selected connotative features. IEEE Transactions on Circuits and Systems for Video Technology 23, 4 (2013), 636–647. [3] Emmanuel Dellandréa, Martijn Huigsloot, Liming Chen, Yoann Baveye, Zhongzhe Xiao, and Mats Sjöberg. 2018. The mediaeval 2018 emo- tional impact of movies task. In MediaEval 2018 Multimedia Benchmark Workshop Working Notes Proceedings of the MediaEval 2018 Workshop. [4] Florian Eyben, Felix Weninger, Florian Gross, and BjÎŞÂűrn Schuller. 2013. Recent developments in openSMILE, the munich open-source multimedia feature extractor. In MM 2013 - Proceedings of the 2013 ACM Multimedia Conference. 835–838. [5] Harish Katti, Karthik Yadati, Mohan Kankanhalli, and Chua Tat-Seng. 2011. Affective video summarization and story board generation using pupillary dilation and eye gaze. In Multimedia (ISM), 2011 IEEE International Symposium on. IEEE, 319–326. [6] Karen Simonyan and Andrew Zisserman. 2014. Very deep convo- lutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014). [7] Shiliang Zhang, Qingming Huang, Shuqiang Jiang, Wen Gao, and Qi Tian. 2010. Affective visualization and retrieval for music video. IEEE Transactions on Multimedia 12, 6 (2010), 510–522. [8] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2018. Places: A 10 million image database for scene recogni- tion. IEEE transactions on pattern analysis and machine intelligence 40, 6 (2018), 1452–1464.