=Paper=
{{Paper
|id=Vol-1984/Mediaeval_2017_paper_5
|storemode=property
|title=The MediaEval 2017 Emotional Impact of Movies Task
|pdfUrl=https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_5.pdf
|volume=Vol-1984
|authors=Emmanuel Dellandréa,Martijn Huigsloot,Liming Chen,Yoann Baveye,Mats Sjöberg
|dblpUrl=https://dblp.org/rec/conf/mediaeval/DellandreaH0BS17
}}
==The MediaEval 2017 Emotional Impact of Movies Task==
The MediaEval 2017 Emotional Impact of Movies Task Emmanuel Dellandréa1 , Martijn Huigsloot2 , Liming Chen1 , Yoann Baveye3 and Mats Sjöberg4 1 Ecole Centrale de Lyon, France, {emmanuel.dellandrea, liming.chen}@ec-lyon.fr 2 NICAM, Netherlands, Huigsloot@nicam.nl 3 Université de Nantes, France, yoann.baveye@univ-nantes.fr 4 HIIT, University of Helsinki, Finland, mats.sjoberg@helsinki.fi ABSTRACT emotion can be considered objective, as it reflects the more-or-less This paper provides a description of the MediaEval 2017 “Emotional unanimous response of a general audience to a given stimulus [8]. Impact of Movies task". It continues to build on previous years’ This year, two new scenarios are proposed as subtasks. In both editions. In this year’s task, participants are expected to create cases, long movies are considered and the emotional impact has to systems that automatically predict the emotional impact that video be predicted for consecutive 10-second segments sliding over the content will have on viewers, in terms of valence, arousal and fear. whole movie with a shift of 5 seconds: Here we provide a description of the use case, task challenges, (1) Valence/Arousal prediction: participants’ systems are sup- dataset and ground truth, task run requirements and evaluation posed to predict a score of expected valence and arousal metrics. for each consecutive 10-second segments. Valence is de- fined as a continuous scale from most negative to most 1 INTRODUCTION positive emotions, while arousal is defined continuously from calmest to most active emotions [10]; Affective video content analysis aims at the automatic recognition (2) Fear prediction: the purpose here is to predict for each of emotions elicited by videos. It has a large number of applications, consecutive 10-second segments whether they are likely to including mood based personalized content recommendation [3] or induce fear or not. The targeted use case is the prediction video indexing [13], and efficient movie visualization and browsing of frightening scenes to help systems protecting children [14]. Beyond the analysis of existing video material, affective com- from potentially harmful video content. This subtask is puting techniques can also be used to generate new content, e.g., complementary to the valence/arousal prediction task in movie summarization [9], or personalized soundtrack recommenda- the sense that the mapping of discrete emotions into the tion to make user-generated videos more attractive [11]. Affective 2D valence/arousal space is often overlapped (for instance, techniques can also be used to enhance the user engagement with fear, disgust and anger are overlapped since they are char- advertising content by optimizing the way ads are inserted inside acterized with very negative valence and high arousal) videos [12]. [7]. While major progress has been achieved in computer vision for visual object detection, scene understanding and high-level concept 3 DATA DESCRIPTION recognition, a natural further step is the modeling and recogni- tion of affective concepts. This has recently received increasing The dataset used in this task is the LIRIS-ACCEDE interest from research communities, e.g., computer vision, machine dataset1 . It contains videos from a set of 160 professionally made learning, with an overall goal of endowing computers with human- and amateur movies, shared under Creative Commons licenses that like perception capabilities. Thus, this task is proposed to offer allow redistribution [2]. Several movie genres are represented in researchers a place to compare their approaches for the prediction this collection of movies such as horror, comedy, drama, action and of the emotional impact of movies. It continues to build on previous so on. Languages are mainly English with a small set of Italian, years’ editions [5] with a first subtask, which is a mix of last year’s Spanish, French and others subtitled in English. tasks related to valence and arousal prediction, and a new subtask The continuous part of LIRIS-ACCEDE [1] is used as the devel- dedicated to fear prediction. opment test for both subtasks. It consists of a selection of 30 movies. The selected videos are between 117 and 4,566 seconds long (mean 2 TASK DESCRIPTION = 884.2sec ± 766.7sec SD). The total length of the 30 selected movies is 7 hours, 22 minutes and 5 seconds. The task requires participants to deploy multimedia features and The test set consists of a selection of 14 movies other than the models to automatically predict the emotional impact of movies. selection of the 160 original movies. They are between 210 and 6,260 This emotional impact is considered here to be the prediction of seconds long (mean = 2045.2sec ± 2450.1sec SD). The total length the expected emotion. The expected emotion is the emotion that of the 14 selected movies is 7 hours, 57 minutes and 13 seconds. the majority of the audience feels in response to the same con- In addition to the video data, participants are also provided with tent. In other words, the expected emotion is the expected value general purpose audio and visual content features. To compute of experienced (i.e. induced) emotion in a population. While the audio features, movies have first been processed to extract consec- induced emotion is subjective and context dependent, the expected utive 10-second segments sliding over the whole movie with a shift Copyright held by the owner/author(s). of 5 seconds. Then, audio features have been extracted from these MediaEval’17, 13-15 September 2017, Dublin, Ireland 1 http://liris-accede.ec-lyon.fr MediaEval’17, 13-15 September 2017, Dublin, Ireland E. Dellandréa et al. segments using openSmile toolbox2 [6]. The default configuration 4.2 Ground Truth for the second subtask named “emobase2010.conf" was used. It allows the computation of Fear annotations for the second subtask were generated using a 1,582 features, which result from a base of 34 low-level descriptors tool specifically designed for the classification of audio-visual me- (LLD) with 34 corresponding delta coefficients appended, and 21 dia allowing to perform annotation while watching the movie (at functionals applied to each of these 68 LLD contours (1 428 fea- the same time). The annotations have been realized by two well tures). In addition, 19 functionals are applied to the 4 pitch-based experienced team members of NICAM5 both of them trained in LLD and their four delta coefficient contours (152 features). Finally classification of media. Each movie has been annotated by 1 annota- the number of pitch onsets (pseudo syllables) and the total duration tor reporting the start and stop times of each sequence in the movie of the input are appended (2 features). exptected to induce fear. From this information, the 10-second Beyond audio features, for each movie, image frames were ex- segments sliding over the whole movie with a shift of 5 seconds tracted every one second. For each of these images, several general have been labeled as fear (value 1) if they intersect one of the fear purpose visual features have been provided. They have been com- sequences and as not fear (value 0) otherwise. puted using LIRE library3 , except CNN features (VGG16 fc6 layer) that have been extracted using Matlab Neural Networks toolbox4 . 5 RUN DESCRIPTION The visual features are the following: Auto Color Correlogram, Participants can submit up to 5 runs for each of the two subtasks, Color and Edge Directivity Descriptor, Color Layout, Edge His- so 10 runs in total. Models can rely on the features provided by the togram, Fuzzy Color and Texture Histogram, Gabor, Joint descriptor organizers or any other external data. joining CEDD and FCTH in one histogram, Scalable Color, Tamura, Local Binary Patterns, VGG16 fc6 layer. 6 EVALUATION CRITERIA Standard evaluation metrics are used to assess systems performance. 4 GROUND TRUTH The first subtask can be considered as a regression problem (esti- Annotations are provided to participants for the 30 movies from the mation of expected valence and arousal scores) while the second development set. Thus, for each movie, a first file contains valence subtask can be seen as a binary classification problem (the video and arousal values for consecutive 10-second segments sliding over segment is supposed to induce/not induce fear). the whole movie with a shift of 5 seconds, and a second file contains For the first subtask, the official metric is the Mean Square Error the indication whether these segments are supposed to induce fear (MSE), which is the common measure generally used to evaluate (value 1) or not (value 0). regression models. However, to allow a deeper understanding of systems’ performance, we also consider Pearson’s Correlation Co- 4.1 Ground Truth for the first subtask efficient. Indeed, MSE is not always sufficient to analyze models In order to collect continuous valence and arousal annotations, efficiency and the correlation may be required to obtain a deeper 16 French participants had to continuously indicate their level of performance analysis. As an example, if a large portion of the data arousal while watching the movies using a modified version of the is neutral (i.e., its valence score is close to 0.5) or is distributed GTrace annotation tool [4] and a joystick (10 participants for the de- around the neutral score, a uniform model that always outputs 0.5 velopment set and 6 for the test set). Movies have been divided into will result in good MSE performance (low MSE). In this case, the two subsets. Each annotator continuously annotated one subset lack of accuracy of the model will be brought to the fore by the considering the induced valence and the other subset considering correlation between the predicted values and the ground truth that the induced arousal. Thus, each movie has been continuously an- will be also very low. notated by five annotators for the development set, and three for For the second subtask, the official metric is the Mean Aver- the test set. age Precision (MAP). Moreover, Accuracy, Precision, Recall and Then, the continuous valence and arousal annotations from the F1-score are also considered to provide insights into systems be- participants have been down-sampled by averaging the annotations haviours. over windows of 10 seconds with a shift of 1 second overlap (i.e., 1 value per second) in order to remove the noise due to unintended 7 CONCLUSIONS movements of the joystick. Finally, these post-processed continuous The Emotional Impact of Movies Task provides participants with a annotations have been averaged in order to create a continuous comparative and collaborative evaluation framework for emotional mean signal of the valence and arousal self-assessments. The details detection in movies, in terms of valence, arousal and fear. The of this processing are given in [1]. For the purpose of the first LIRIS-ACCEDE dataset has been used as development and test sets. subtask, these values have been averaged to obtain a single value Details on the methods and results of each individual team can be of valence and a single value of arousal for every consecutive 10- found in the papers of the participating teams in the MediaEval second segments sliding over the whole movie with a shift of 5 2017 workshop proceedings. seconds. ACKNOWLEDGMENTS This task is supported by the CHIST-ERA Visen project ANR-12- 2 http://audeering.com/technology/opensmile/ CHRI-0002-04. 3 http://www.lire-project.net/ 4 https://www.mathworks.com/products/neural-network.html 5 http://www.kijkwijzer.nl/nicam Emotional Impact of Movies Task MediaEval’17, 13-15 September 2017, Dublin, Ireland REFERENCES [1] Y. Baveye, E. Dellandréa, C. Chamaret, and L. Chen. 2015. Deep Learning vs. Kernel Methods: Performance for Emotion Prediction in Videos. In Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII). [2] Y. Baveye, E. Dellandréa, C. Chamaret, and L. Chen. 2015. LIRIS- ACCEDE: A Video Database for Affective Content Analysis. IEEE Transactions on Affective Computing 6, 1 (2015), 43–55. [3] L. Canini, S. Benini, and R. Leonardi. 2013. Affective recommendation of movies based on selected connotative features. IEEE Transactions on Circuits and Systems for Video Technology 23, 4 (2013), 636–647. [4] R. Cowie, M. Sawey, C. Doherty, J. Jaimovich, C. Fyans, and P. Stapleton. 2013. Gtrace: General trace program compatible with emotionml.. In Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII). [5] E. Dellandréa, L. Chen, Y. Baveye, M. Sjöberg, and C. Chamaret. 2016. The MediaEval 2016 Emotional Impact of Movies Task. In MediaEval 2016 Workshop. [6] F. Eyben, F. Weninger, F. Gross, and B. Schuller. 2013. Recent Develop- ments in openSMILE, the Munich Open-Source Multimedia Feature Extractor. In ACM Multimedia (MM), Barcelona, Spain. [7] L.-A. Feldman. 1995. Valence focus and arousal focus: Individual differences in the structure of affective experience. 69 (1995), 153– 166. [8] A. Hanjalic. 2006. Extracting moods from pictures and sounds: To- wards truly personalized TV. IEEE Signal Processing Magazine (2006). [9] H. Katti, K. Yadati, M. Kankanhalli, and C. TatSeng. 2011. Affective video summarization and story board generation using pupillary di- lation and eye gaze. In IEEE International Symposium on Multimedia (ISM). [10] J. A. Russell. 2003. Core affect and the psychological construction of emotion. Psychological Review (2003). [11] R. R. Shah, Y. Yu, and R. Zimmermann. 2014. Advisor: Personalized video soundtrack recommendation by late fusion with heuristic rank- ings. In ACM International Conference on Multimedia. [12] K. Yadati, H. Katti, and M. Kankanhalli. 2014. Cavva: Computational affective video-in-video advertising. IEEE Transactions on Multimedia 16, 1 (2014), 15–23. [13] S. Zhang, Q. Huang, S. Jiang, W. Gao, and Q. Tian. 2010. Affective visualization and retrieval for music video. IEEE Transactions on Multimedia 12, 6 (2010), 510–522. [14] S. Zhao, H. Yao, X. Sun, X. Jiang, and P. Xu. 2013. Flexible presentation of videos based on affective content analysis. Advances in Multimedia Modeling 7732 (2013).