The MediaEval 2018 Emotional Impact of Movies Task

The MediaEval 2018 Emotional Impact of Movies Task EmmanuelDellandréa emmanuel.dellandrea@ec-lyon.fr Ecole Centrale de Lyon

France

MartijnHuigsloot huigsloot@nicam.nl NICAM

Netherlands

LimingChen liming.chen@ec-lyon.fr Ecole Centrale de Lyon

France

YoannBaveye yoann.baveye@capacites.fr

Capacités France

ZhongzheXiao xiaozhongzhe@suda.edu.cn Soochow University

China

Sjöberg Aalto University

Finland

The MediaEval 2018 Emotional Impact of Movies Task 53E59DEBD2284036BB683DB2A50A076E GROBID - A machine learning software for extracting information from scholarly documents

This paper provides a description of the MediaEval 2018 "Emotional Impact of Movies task". It continues to build on last year's edition, integrating the feedback of previous participants. The goal is to create systems that automatically predict the emotional impact that video content will have on viewers, in terms of valence, arousal and fear. Here we provide a description of the use case, task challenges, dataset and ground truth, task run requirements and evaluation metrics.

TASK DESCRIPTION

The task requires participants to deploy multimedia features and models to automatically predict the emotional impact of movies. This emotional impact is considered here to be the prediction of the expected emotion. The expected emotion is the emotion that Copyright held by the owner/author(s).

INTRODUCTION

Affective video content analysis aims at the automatic recognition of emotions elicited by videos. It has a large number of applications, including mood based personalized content recommendation [3] or video indexing [12], and efficient movie visualization and browsing [13]. Beyond the analysis of existing video material, affective computing techniques can also be used to generate new content, e.g., movie summarization [8], or personalized soundtrack recommendation to make user-generated videos more attractive [10]. Affective techniques can also be used to enhance the user engagement with advertising content by optimizing the way ads are inserted inside videos [11].

While major progress has been achieved in computer vision for visual object detection, scene understanding and high-level concept recognition, a natural further step is the modeling and recognition of affective concepts. This has recently received increasing interest from research communities, e.g., computer vision, machine learning, with an overall goal of endowing computers with humanlike perception capabilities. Thus, this task is proposed to offer researchers a place to compare their approaches for the prediction of the emotional impact of movies. It continues to build on last year's edition [5] integrating the feedback of participants. The task consists of two subtasks, the first one being related to valence and arousal prediction, and the second one to fear detection. the majority of the audience feels in response to the same content. In other words, the expected emotion is the expected value of experienced (i.e. induced) emotion in a population. While the induced emotion is subjective and context dependent, the expected emotion can be considered objective, as it reflects the more-or-less unanimous response of a general audience to a given stimulus [7].

This year, two scenarios are proposed as subtasks. In both cases, long movies are considered.

(1) Valence and Arousal prediction: participants' systems have to predict a score of expected valence and arousal continuously (every second) along movies. Valence is defined on a continuous scale from most negative to most positive emotions, while arousal is defined continuously from calmest to most active emotions [9]; (2) Fear detection: the purpose here is to predict beginning and ending times of sequences inducing fear in movies. The targeted use case is the detection of frightening scenes to help systems protecting children from potentially harmful video content.

DATA DESCRIPTION

The dataset used in this task is the LIRIS-ACCEDE dataset 1 . It contains videos from a set of 160 professionally made and amateur movies, shared under Creative Commons licenses that allow redistribution [2]. Several movie genres are represented in this collection of movies such as horror, comedy, drama, action and so on. Languages are mainly English with a small set of Italian, Spanish, French and others subtitled in English. A total of 44 movies (total duration of 15 hours and 20 minutes) selected from the set of 160 movies are provided as development set for both subtasks with the annotations according to fear, valence and arousal. A complementary set of 10 movies (11 hours and 29 minutes) is available for the first subtask with the valence and arousal annotations.

The test set consists of 12 other movies selected from the set of 160 movies, for a total duration of 8 hours and 56 minutes.

In addition to the video data, participants are also provided with general purpose audio and visual content features. To compute audio features, movies have first been processed to extract consecutive 5-second segments sliding over the whole movie with a shift of 1 second. Then, audio features have been extracted from these segments using openSmile toolbox 2 [6]. The default configuration named "emobase2010.conf" was used. It allows the computation of MediaEval'18, 29-31 October 2018, Sophia Antipolis, France E. Dellandréa et al.

1,582 features, which result from a base of 34 low-level descriptors (LLD) with 34 corresponding delta coefficients appended, and 21 functionals applied to each of these 68 LLD contours (1,428 features). In addition, 19 functionals are applied to the 4 pitch-based LLD and their four delta coefficient contours (152 features). Finally the number of pitch onsets (pseudo syllables) and the total duration of the input are appended (2 features).

Beyond audio features, for each movie, image frames were extracted every one second. For each of these images, several general purpose visual features have been provided. They have been computed using LIRE library 3 , except CNN features (VGG16 fc6 layer) that have been extracted using Matlab Neural Networks toolbox 4

GROUND TRUTH

As mentioned in the previous section, the development set contains a part that is common to both subtasks with valence, arousal and fear annotations (44 movies), and an additional part only concerning the first subtask, with valence and arousal annotations (10 movies).

For each movie from the development set for the first subtask, a file is provided containing valence and arousal values for each second of the movie.

Moreover, for all movies from the development set for the second subtask, a file is provided containing the beginning and ending times of each sequence in the movie inducing fear.

Ground Truth for the first subtask

In order to collect continuous valence and arousal annotations, a total of 28 French participants had to continuously indicate their level of valence and arousal while watching the movies using a modified version of the GTrace annotation tool [4] and a joystick. Each annotator continuously annotated one subset of the movies considering the induced valence and another subset considering the induced arousal, for a total duration of around 8 hours on 2 days. Thus, each movie has been continuously annotated by three to five different annotators.

Then, the continuous valence and arousal annotations from the participants have been down-sampled by averaging the annotations over windows of 10 seconds with a shift of 1 second overlap (i.e., 1 value per second) in order to remove the noise due to unintended movements of the joystick. Finally, these post-processed continuous annotations have been averaged in order to create a continuous mean signal of the valence and arousal self-assessments, ranging from -1 (most negative for valence, most passive for arousal) to +1 (most positive for valence, most active for arousal). The details of this processing are given in [1].

Ground Truth for the second subtask

Fear annotations for the second subtask were generated using a tool specifically designed for the classification of audio-visual media allowing to perform annotation while watching the movie (at the same time). The annotations have been realized by two well experienced team members of NICAM 5 both of them trained in classification of media. Each movie has been annotated by 1 annotator reporting the start and stop times of each sequence in the movie expected to induce fear.

RUN DESCRIPTION

Participants can submit up to 5 runs for each of the two subtasks, so 10 runs in total. Models can rely on the features provided by the organizers or any other external data.

EVALUATION CRITERIA

Standard evaluation metrics are used to assess systems performance. The first subtask can be considered as a regression problem (estimation of expected valence and arousal scores) while the second subtask can be seen as a binary classification problem (the video segment is supposed to induce/not induce fear).

For the first subtask, the official metric is the Mean Square Error (MSE), which is the common measure generally used to evaluate regression models. However, to allow a deeper understanding of systems' performance, we also consider Pearson's Correlation Coefficient. Indeed, MSE is not always sufficient to analyze models efficiency and the correlation may be required to obtain a deeper performance analysis. As an example, if a large portion of the data is neutral (i.e., its valence score is close to 0.5) or is distributed around the neutral score, a uniform model that always outputs 0.5 will result in good MSE performance (low MSE). In this case, the lack of accuracy of the model will be brought to the fore by the correlation between the predicted values and the ground truth that will be also very low.

For the second subtask, as the goal is to detect time sequences inducing fear, the official metric is the Intersection over Union of time intervals.

CONCLUSION

The Emotional Impact of Movies Task provides participants with a comparative and collaborative evaluation framework for emotional detection in movies, in terms of valence, arousal and fear. The LIRIS-ACCEDE dataset has been used as development and test sets. Details on the methods and results of each individual team can be found in the papers of the participating teams in the MediaEval 2018 workshop proceedings.

. The visual features are the following: Auto Color Correlogram, Color and Edge Directivity Descriptor, Color Layout, Edge Histogram, Fuzzy Color and Texture Histogram, Gabor, Joint descriptor joining CEDD and FCTH in one histogram, Scalable Color, Tamura, Local Binary Patterns, VGG16 fc6 layer.http://liris-accede.ec-lyon.frhttp://www.lire-project.net/https://www.mathworks.com/products/neural-network.html

ACKNOWLEDGMENTS

This task is supported by the CHIST-ERA Visen project ANR-12-CHRI-0002-04.

Deep Learning vs. Kernel Methods: Performance for Emotion Prediction Emotional Impact of Movies Task MediaEval'18, 29 YBaveye EDellandréa CChamaret LChen Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII) 2015. October 2018 31 Sophia Antipolis, France in Videos LIRIS-ACCEDE: A Video Database for Affective Content Analysis YBaveye EDellandréa CChamaret LChen IEEE Transactions on Affective Computing 6 1 2015. 2015 Affective recommendation of movies based on selected connotative features LCanini SBenini RLeonardi IEEE Transactions on Circuits and Systems for Video Technology 23 4 2013. 2013 Gtrace: General trace program compatible with emotionml RCowie MSawey CDoherty JJaimovich CFyans PStapleton Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII) 2013 The MediaEval 2017 Emotional Impact of Movies Task EDellandréa MHuigsloot LChen YBaveye MSjöberg MediaEval 2017 Workshop 2017 Recent Developments in openSMILE, the Munich Open-Source Multimedia Feature Extractor FEyben FWeninger FGross BSchuller ACM Multimedia (MM)

Barcelona, Spain

2013 Extracting moods from pictures and sounds: Towards truly personalized TV AHanjalic IEEE Signal Processing Magazine 2006. 2006 Affective video summarization and story board generation using pupillary dilation and eye gaze HKatti KYadati MKankanhalli CTatseng IEEE International Symposium on Multimedia (ISM) 2011 Core affect and the psychological construction of emotion JARussell Psychological Review 2003. 2003 Advisor: Personalized video soundtrack recommendation by late fusion with heuristic rankings RRShah YYu RZimmermann ACM International Conference on Multimedia 2014 Cavva: Computational affective video-in-video advertising KYadati HKatti MKankanhalli IEEE Transactions on Multimedia 16 1 2014. 2014 Affective visualization and retrieval for music video SZhang QHuang SJiang WGao QTian IEEE Transactions on Multimedia 12 6 2010. 2010 Flexible presentation of videos based on affective content analysis SZhao HYao XSun XJiang PXu Advances in Multimedia Modeling 7732 2013. 2013