The MediaEval 2016 Emotional Impact of Movies Task Emmanuel Dellandréa1 , Liming Chen1 , Yoann Baveye2 , Mats Sjöberg3 and Christel Chamaret4 1 Ecole Centrale de Lyon, France, {emmanuel.dellandrea, liming.chen}@ec-lyon.fr 2 Université de Nantes, France, yoann.baveye@univ-nantes.fr 3 HIIT, University of Helsinki, Finland, mats.sjoberg@helsinki.fi 4 Technicolor, France, christel.chamaret@technicolor.com ABSTRACT ample what the viewer believes that he or she is expected This paper provides a description of the MediaEval 2016 to feel. The emotion is considered in terms of valence and ”Emotional Impact of Movies” task. It continues builds on arousal [8]. Valence is defined as a continuous scale from previous years’ editions of the Affect in Multimedia Task: most negative to most positive emotions, while arousal is Violent Scenes Detection. However, in this year’s task, par- defined continuously from calmest to most active emotions. ticipants are expected to create systems that automatically Two subtasks are considered: predict the emotional impact that video content will have on viewers, in terms of valence and arousal scores. Here we pro- 1. Global emotion prediction: given a short video clip vide insights on the use case, task challenges, dataset and (around 10 seconds), participants’ systems are expected ground truth, task run requirements and evaluation met- to predict a score of induced valence (negative-positive) rics. and induced arousal (calm-excited) for the whole clip; 2. Continuous emotion prediction: as an emotion felt dur- 1. INTRODUCTION ing a scene may be influenced by the emotions felt dur- Affective video content analysis aims at the automatic ing the previous ones, the purpose here is to consider recognition of emotions elicited by videos. It has a large longer videos, and to predict the valence and arousal number of applications, including mood based personalized continuously along the video. Thus, a score of induced content recommendation [5] or video indexing [12], and ef- valence and arousal should be provided for each 1s- ficient movie visualization and browsing [13]. Beyond the segment of the video. analysis of existing video material, affective computing tech- niques can also be used to generate new content, e.g., movie summarization [7], or personalized soundtrack recommen- 3. DATA DESCRIPTION dation to make user-generated videos more attractive [9]. The development dataset used in this task is the LIRIS- Affective techniques can also be used to enhance the user ACCEDE dataset (liris-accede.ec-lyon.fr) [3]. It is composed engagement with advertising content by optimizing the way of two subsets. The first one, used for the first subtask ads are inserted inside videos [11]. (global emotion prediction), contains 9,800 video clips ex- While major progress has been achieved in computer vi- tracted from 160 professionally made and amateur movies, sion for visual object detection, scene understanding and with different genres, and shared under Creative Commons high-level concept recognition, a natural further step is the licenses that allows to freely use and distribute videos with- modeling and recognition of affective concepts. This has re- out copyright issues as long as the original creator is cred- cently received increasing interest from research communi- ited. The segmented video clips last between 8 and 12 sec- ties, e.g., computer vision, machine learning, with an overall onds and are representative enough to conduct experiments. goal of endowing computers with human-like perception ca- Indeed, the length of extracted segments is large enough to pabilities. Thus, this task is proposed to offer researchers a get consistent excerpts allowing the viewer to feel emotions place to compare their approaches for the prediction of the and is also small enough to make the viewer feel only one emotional impact of movies. It continues builds on previous emotion per excerpt. A robust shot and fade in/out detec- years’ editions of the Affect in Multimedia Task: Violent tion has been implemented using to make sure that each Scenes Detection [10]. extracted video clip start and end with a shot or a fade. Several movie genres are represented in this collection of movies such as horror, comedy, drama, action and so on. 2. TASK DESCRIPTION Languages are mainly English with a small set of Italian, The task requires participants to deploy multimedia fea- Spanish, French and others subtitled in English. tures to automatically predict the emotional impact of movies. The second part of LIRIS-ACCEDE dataset is used for the We are focusing on felt emotion, i.e., the actual emotion of second subtask (continuous emotion prediction). It consists the viewer when watching the video, rather than for ex- in a selection of movies among the 160 ones used to extract the 9,800 video clips mentioned previously. The total length of the selected movies was the only constraint. It had to be Copyright is held by the author/owner(s). MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether- smaller than eight hours to create an experiment of accept- lands. able duration. The selection process ended with the choice of 30 movies so that their genre, content, language and du- valence and the other into the induced arousal. Thus, each ration are diverse enough to be representative of the original movie has been continuously annotated by five annotators LIRIS-ACCEDE dataset. The selected videos are between for the development set, and three for the test set. 117 and 4,566 seconds long (mean = 884.2sec ± 766.7sec Then, the continuous valence and arousal annotations from SD). The total length of the 30 selected movies is 7 hours, the participants have been down-sampled by averaging the 22 minutes and 5 seconds. annotations over windows of 10 seconds with 1 second over- In addition to the development set, a test set is also pro- lap (i.e., 1 value per second) in order to remove the noise vided to assess participants’ methods performance. 49 new due to unintended moves of the joystick. Finally, these post- movies under Creative Commons licenses have been consid- processed continuous annotations have been averaged in or- ered. With the same protocol as the one used for the de- der to create a continuous mean signal of the valence and velopment set, 1,200 additional short video clips have been arousal self-assessments. The details of this processing are extracted for the first subtask (between 8 and 12 seconds), given in [2]. and 10 long movies (from 25 minutes to 1 hour and 35 min- utes) have been selected for the second subtask (for a total 5. RUN DESCRIPTION duration of 11.48 hours). Participants can submit up to 5 runs for the first subtask In solving the task, participants are expected to exploit (global emotion prediction). For the second subtask (con- the provided resources. Use of external resources (e.g., In- tinuous emotion prediction), there can be 2 types of run ternet data) will be however allowed as specific runs. submissions: full runs that concerns the whole test set (the Along with the video material and the annotations, fea- 10 movies, total duration: 11.48 hours) and light runs that tures extracted from each video clip are also provided by concern a subset of the test set (5 movies, total duration: the organizers for the first subtask. They correspond to the 4.82 hours). In each case (light and full), up to 5 runs can audiovisual features described in [3]. be submitted. Moreover, each subtask has a required run which uses no externale training data, only the provided de- 4. GROUND TRUTH velopment data is allowed. Also any features that can be automatically extracted from the video are allowed. Both 4.1 Ground Truth for the first subtask tasks also have the possibility for optional runs in which The 9,800 video clips included in the first part of the any external data can be used, such as Internet sources, as LIRIS-ACCEDE dataset are ranked along the felt valence long as they are marked as ”external data” runs. and arousal axes by using a crowdsourcing protocol [3]. To make reliable annotations as simple as possible, pairwise 6. EVALUATION CRITERIA comparisons were generated using the quicksort algorithm Standard evaluation metrics (Mean Square Error and Pear- and presented to crowdworkers who had to select the video son’s Correlation Coefficient) are used to assess systems per- inducing the calmest emotion or the most positive emotion. formance. Indeed, the common measure generally used to To cross-validate the annotations gathered from various evaluate regression models is the Mean Square Error (MSE). uncontrolled environments using crowdsourcing, another ex- However, this measure is not always sufficient to analyze periment has been created to collect ratings for a subset of models efficiency and the correlation may be required to ob- the database in a controlled environment. In this controlled tain a deeper performance analysis. As an example, if a experiment, 28 volunteers were asked to rate a subset of the large portion of the data is neutral (i.e., its valence score is database carefully selected using the 5-point discrete Self- close to 0.5) or is distributed around the neutral score, a uni- Assessment-Manikin scales for valence and arousal [4]. 20 form model that always outputs 0.5 will result in good MSE excerpts per axis that are regularly distributed have been performance (low MSE). In this case, the lack of accuracy selected in order to get enough excerpts to represent the of the model will be brought to the fore by the correlation whole database while being relatively few to create an ex- between the predicted values and the ground truth that will periment of acceptable duration. be also very low. From the original ranks and these ratings, absolute affec- tive scores for valence and arousal have been estimated for each of the 9,800 video clips using Gaussian process regres- 7. CONCLUSIONS sion models as described in [1]. The Emotional Impact of Movies Task provides partici- To obtain ground truth for the test subset, each of the pants with a comparative and collaborative evaluation frame- 1,200 additional video clips has first been ranked according work for emotional detection in movies, in terms of valence to the 9,800 video clips from the original dataset. Then, and arousal scores. The LIRIS-ACCEDE dataset 1 has been its valence and arousal ranks have been converted into a used as development set, and additional movies under Cre- valence and arousal score using the regression models men- ative Commons licenses and ground truth annotations have tioned previously. been provided as test set. Details on the methods and re- sults of each individual team can be found in the papers 4.2 Ground Truth for the second subtask of the participating teams in the MediaEval 2016 workshop In order to collect continuous valence and arousal anno- proceedings. tations, 16 French participants had to continuously indicate their level of arousal while watching the movies using a mod- 8. ACKNOWLEDGMENTS ified version of the GTrace annotation tool [6] and a joystick This task is supported by the CHIST-ERA Visen project (10 participants for the development set and 6 for the test ANR-12-CHRI-0002-04. set). Movies have been divided into two subsets. Each anno- 1 tator continuously annotated one subset along the induced http://liris-accede.ec-lyon.fr 9. REFERENCES [1] Y. Baveye, E. Dellandréa, C. Chamaret, and L. Chen. From crowdsourced rankings to affective ratings. In IEEE International Conference on Multimedia and Expo Workshops (ICMEW), 2014. [2] Y. Baveye, E. Dellandréa, C. Chamaret, and L. Chen. Deep learning vs. kernel methods: Performance for emotion prediction in videos. In Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII), 2015. [3] Y. Baveye, E. Dellandréa, C. Chamaret, and L. Chen. Liris-accede: A video database for affective content analysis. IEEE Transactions on Affective Computing, 2015. [4] M. M. Bradley and P. J. Lang. Measuring emotion: the self-assessment manikin and the semantic differential. Journal of behavior therapy and experimental psychiatry, 1994. [5] L. Canini, S. Benini, and R. Leonardi. Affective recommendation of movies based on selected connotative features. IEEE Transactions on Circuits and Systems for Video Technology, 2013. [6] R. Cowie, M. Sawey, C. Doherty, J. Jaimovich, C. Fyans, and P. Stapleton. Gtrace: General trace program compatible with emotionml. In Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII), 2013. [7] H. Katti, K. Yadati, M. Kankanhalli, and C. TatSeng. Affective video summarization and story board generation using pupillary dilation and eye gaze. In IEEE International Symposium on Multimedia (ISM), 2011. [8] J. A. Russell. Core affect and the psychological construction of emotion. Psychological Review, 2003. [9] R. R. Shah, Y. Yu, and R. Zimmermann. Advisor: Personalized video soundtrack recommendation by late fusion with heuristic rankings. In ACM International Conference on Multimedia, 2014. [10] M. Sjöberg, Y. Baveye, H. Wang, V. Quang, B. Ionescu, E. Dellandréa, M. Schedl, C.-H. Demarty, and L. Chen. The mediaeval 2015 affective impact of movies task. In MediaEval 2015 Workshop, 2015. [11] K. Yadati, H. Katti, and M. Kankanhalli. Cavva: Computational affective video-in-video advertising. IEEE Transactions on Multimedia, 2014. [12] S. Zhang, Q. Huang, S. Jiang, W. Gao, and Q. Tian. Affective visualization and retrieval for music video. IEEE Transactions on Multimedia, 2010. [13] S. Zhao, H. Yao, X. Sun, X. Jiang, and P. Xu. Flexible presentation of videos based on affective content analysis. Advances in Multimedia Modeling, 2013.