=Paper=
{{Paper
|id=Vol-2283/MediaEval_18_paper_4
|storemode=property
|title=The MediaEval 2018 Emotional Impact of Movies Task
|pdfUrl=https://ceur-ws.org/Vol-2283/MediaEval_18_paper_4.pdf
|volume=Vol-2283
|authors=Emmanuel Dellandréa,Martijn Huigsloot,Liming Chen,Yoann Baveye,Zhongzhe Xiao,Mats Sjöberg
|dblpUrl=https://dblp.org/rec/conf/mediaeval/DellandreaH0BXS18
}}
==The MediaEval 2018 Emotional Impact of Movies Task==
The MediaEval 2018 Emotional Impact of Movies Task
Emmanuel Dellandréa1 , Martijn Huigsloot2 , Liming Chen1 , Yoann Baveye3 , Zhongzhe Xiao4 and
Mats Sjöberg5
1 Ecole Centrale de Lyon, France, {emmanuel.dellandrea, liming.chen}@ec-lyon.fr
2 NICAM, Netherlands, huigsloot@nicam.nl
3 Capacités, France, yoann.baveye@capacites.fr
4 Soochow University, China, xiaozhongzhe@suda.edu.cn
5 Aalto University, Finland, mats.sjoberg@aalto.fi
ABSTRACT the majority of the audience feels in response to the same con-
This paper provides a description of the MediaEval 2018 “Emotional tent. In other words, the expected emotion is the expected value
Impact of Movies task". It continues to build on last year’s edition, of experienced (i.e. induced) emotion in a population. While the
integrating the feedback of previous participants. The goal is to induced emotion is subjective and context dependent, the expected
create systems that automatically predict the emotional impact that emotion can be considered objective, as it reflects the more-or-less
video content will have on viewers, in terms of valence, arousal and unanimous response of a general audience to a given stimulus [7].
fear. Here we provide a description of the use case, task challenges, This year, two scenarios are proposed as subtasks. In both cases,
dataset and ground truth, task run requirements and evaluation long movies are considered.
metrics. (1) Valence and Arousal prediction: participants’ systems have
to predict a score of expected valence and arousal continu-
ously (every second) along movies. Valence is defined on a
1 INTRODUCTION continuous scale from most negative to most positive emo-
Affective video content analysis aims at the automatic recognition tions, while arousal is defined continuously from calmest
of emotions elicited by videos. It has a large number of applications, to most active emotions [9];
including mood based personalized content recommendation [3] or (2) Fear detection: the purpose here is to predict beginning and
video indexing [12], and efficient movie visualization and browsing ending times of sequences inducing fear in movies. The
[13]. Beyond the analysis of existing video material, affective com- targeted use case is the detection of frightening scenes to
puting techniques can also be used to generate new content, e.g., help systems protecting children from potentially harmful
movie summarization [8], or personalized soundtrack recommenda- video content.
tion to make user-generated videos more attractive [10]. Affective
techniques can also be used to enhance the user engagement with
3 DATA DESCRIPTION
advertising content by optimizing the way ads are inserted inside The dataset used in this task is the LIRIS-ACCEDE dataset1 . It
videos [11]. contains videos from a set of 160 professionally made and ama-
While major progress has been achieved in computer vision for teur movies, shared under Creative Commons licenses that allow
visual object detection, scene understanding and high-level concept redistribution [2]. Several movie genres are represented in this col-
recognition, a natural further step is the modeling and recogni- lection of movies such as horror, comedy, drama, action and so on.
tion of affective concepts. This has recently received increasing Languages are mainly English with a small set of Italian, Spanish,
interest from research communities, e.g., computer vision, machine French and others subtitled in English.
learning, with an overall goal of endowing computers with human- A total of 44 movies (total duration of 15 hours and 20 minutes)
like perception capabilities. Thus, this task is proposed to offer selected from the set of 160 movies are provided as development set
researchers a place to compare their approaches for the prediction for both subtasks with the annotations according to fear, valence
of the emotional impact of movies. It continues to build on last and arousal. A complementary set of 10 movies (11 hours and 29
year’s edition [5] integrating the feedback of participants. The task minutes) is available for the first subtask with the valence and
consists of two subtasks, the first one being related to valence and arousal annotations.
arousal prediction, and the second one to fear detection. The test set consists of 12 other movies selected from the set of
160 movies, for a total duration of 8 hours and 56 minutes.
2 TASK DESCRIPTION In addition to the video data, participants are also provided with
general purpose audio and visual content features. To compute
The task requires participants to deploy multimedia features and
audio features, movies have first been processed to extract consec-
models to automatically predict the emotional impact of movies.
utive 5-second segments sliding over the whole movie with a shift
This emotional impact is considered here to be the prediction of
of 1 second. Then, audio features have been extracted from these
the expected emotion. The expected emotion is the emotion that
segments using openSmile toolbox2 [6]. The default configuration
named “emobase2010.conf" was used. It allows the computation of
Copyright held by the owner/author(s).
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France 1 http://liris-accede.ec-lyon.fr
2 http://audeering.com/technology/opensmile/
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France E. Dellandréa et al.
1,582 features, which result from a base of 34 low-level descriptors 4.2 Ground Truth for the second subtask
(LLD) with 34 corresponding delta coefficients appended, and 21 Fear annotations for the second subtask were generated using a
functionals applied to each of these 68 LLD contours (1,428 fea- tool specifically designed for the classification of audio-visual me-
tures). In addition, 19 functionals are applied to the 4 pitch-based dia allowing to perform annotation while watching the movie (at
LLD and their four delta coefficient contours (152 features). Finally the same time). The annotations have been realized by two well
the number of pitch onsets (pseudo syllables) and the total duration experienced team members of NICAM5 both of them trained in
of the input are appended (2 features). classification of media. Each movie has been annotated by 1 an-
Beyond audio features, for each movie, image frames were ex- notator reporting the start and stop times of each sequence in the
tracted every one second. For each of these images, several general movie expected to induce fear.
purpose visual features have been provided. They have been com-
puted using LIRE library3 , except CNN features (VGG16 fc6 layer) 5 RUN DESCRIPTION
that have been extracted using Matlab Neural Networks toolbox4 .
Participants can submit up to 5 runs for each of the two subtasks,
The visual features are the following: Auto Color Correlogram,
so 10 runs in total. Models can rely on the features provided by the
Color and Edge Directivity Descriptor, Color Layout, Edge His-
organizers or any other external data.
togram, Fuzzy Color and Texture Histogram, Gabor, Joint descriptor
joining CEDD and FCTH in one histogram, Scalable Color, Tamura,
6 EVALUATION CRITERIA
Local Binary Patterns, VGG16 fc6 layer.
Standard evaluation metrics are used to assess systems performance.
The first subtask can be considered as a regression problem (esti-
4 GROUND TRUTH mation of expected valence and arousal scores) while the second
As mentioned in the previous section, the development set contains subtask can be seen as a binary classification problem (the video
a part that is common to both subtasks with valence, arousal and segment is supposed to induce/not induce fear).
fear annotations (44 movies), and an additional part only concerning For the first subtask, the official metric is the Mean Square Error
the first subtask, with valence and arousal annotations (10 movies). (MSE), which is the common measure generally used to evaluate
For each movie from the development set for the first subtask, regression models. However, to allow a deeper understanding of
a file is provided containing valence and arousal values for each systems’ performance, we also consider Pearson’s Correlation Co-
second of the movie. efficient. Indeed, MSE is not always sufficient to analyze models
Moreover, for all movies from the development set for the second efficiency and the correlation may be required to obtain a deeper
subtask, a file is provided containing the beginning and ending performance analysis. As an example, if a large portion of the data
times of each sequence in the movie inducing fear. is neutral (i.e., its valence score is close to 0.5) or is distributed
around the neutral score, a uniform model that always outputs 0.5
will result in good MSE performance (low MSE). In this case, the
4.1 Ground Truth for the first subtask
lack of accuracy of the model will be brought to the fore by the
In order to collect continuous valence and arousal annotations, a correlation between the predicted values and the ground truth that
total of 28 French participants had to continuously indicate their will be also very low.
level of valence and arousal while watching the movies using a For the second subtask, as the goal is to detect time sequences
modified version of the GTrace annotation tool [4] and a joystick. inducing fear, the official metric is the Intersection over Union of
Each annotator continuously annotated one subset of the movies time intervals.
considering the induced valence and another subset considering
the induced arousal, for a total duration of around 8 hours on 2 7 CONCLUSION
days. Thus, each movie has been continuously annotated by three
The Emotional Impact of Movies Task provides participants with a
to five different annotators.
comparative and collaborative evaluation framework for emotional
Then, the continuous valence and arousal annotations from the
detection in movies, in terms of valence, arousal and fear. The
participants have been down-sampled by averaging the annotations
LIRIS-ACCEDE dataset has been used as development and test sets.
over windows of 10 seconds with a shift of 1 second overlap (i.e., 1
Details on the methods and results of each individual team can be
value per second) in order to remove the noise due to unintended
found in the papers of the participating teams in the MediaEval
movements of the joystick. Finally, these post-processed continuous
2018 workshop proceedings.
annotations have been averaged in order to create a continuous
mean signal of the valence and arousal self-assessments, ranging
ACKNOWLEDGMENTS
from -1 (most negative for valence, most passive for arousal) to +1
(most positive for valence, most active for arousal). The details of This task is supported by the CHIST-ERA Visen project ANR-12-
this processing are given in [1]. CHRI-0002-04.
REFERENCES
[1] Y. Baveye, E. Dellandréa, C. Chamaret, and L. Chen. 2015. Deep
Learning vs. Kernel Methods: Performance for Emotion Prediction
3 http://www.lire-project.net/
4 https://www.mathworks.com/products/neural-network.html
5 http://www.kijkwijzer.nl/nicam
Emotional Impact of Movies Task MediaEval’18, 29-31 October 2018, Sophia Antipolis, France
in Videos. In Humaine Association Conference on Affective Computing
and Intelligent Interaction (ACII).
[2] Y. Baveye, E. Dellandréa, C. Chamaret, and L. Chen. 2015. LIRIS-
ACCEDE: A Video Database for Affective Content Analysis. IEEE
Transactions on Affective Computing 6, 1 (2015), 43–55.
[3] L. Canini, S. Benini, and R. Leonardi. 2013. Affective recommendation
of movies based on selected connotative features. IEEE Transactions
on Circuits and Systems for Video Technology 23, 4 (2013), 636–647.
[4] R. Cowie, M. Sawey, C. Doherty, J. Jaimovich, C. Fyans, and P. Stapleton.
2013. Gtrace: General trace program compatible with emotionml.. In
Humaine Association Conference on Affective Computing and Intelligent
Interaction (ACII).
[5] E. Dellandréa, M. Huigsloot, L. Chen, Y. Baveye, and M. Sjöberg. 2017.
The MediaEval 2017 Emotional Impact of Movies Task. In MediaEval
2017 Workshop.
[6] F. Eyben, F. Weninger, F. Gross, and B. Schuller. 2013. Recent Develop-
ments in openSMILE, the Munich Open-Source Multimedia Feature
Extractor. In ACM Multimedia (MM), Barcelona, Spain.
[7] A. Hanjalic. 2006. Extracting moods from pictures and sounds: To-
wards truly personalized TV. IEEE Signal Processing Magazine (2006).
[8] H. Katti, K. Yadati, M. Kankanhalli, and C. TatSeng. 2011. Affective
video summarization and story board generation using pupillary di-
lation and eye gaze. In IEEE International Symposium on Multimedia
(ISM).
[9] J. A. Russell. 2003. Core affect and the psychological construction of
emotion. Psychological Review (2003).
[10] R. R. Shah, Y. Yu, and R. Zimmermann. 2014. Advisor: Personalized
video soundtrack recommendation by late fusion with heuristic rank-
ings. In ACM International Conference on Multimedia.
[11] K. Yadati, H. Katti, and M. Kankanhalli. 2014. Cavva: Computational
affective video-in-video advertising. IEEE Transactions on Multimedia
16, 1 (2014), 15–23.
[12] S. Zhang, Q. Huang, S. Jiang, W. Gao, and Q. Tian. 2010. Affective
visualization and retrieval for music video. IEEE Transactions on
Multimedia 12, 6 (2010), 510–522.
[13] S. Zhao, H. Yao, X. Sun, X. Jiang, and P. Xu. 2013. Flexible presentation
of videos based on affective content analysis. Advances in Multimedia
Modeling 7732 (2013).