=Paper=
{{Paper
|id=Vol-2283/MediaEval_18_paper_4
|storemode=property
|title=The MediaEval 2018 Emotional Impact of Movies Task
|pdfUrl=https://ceur-ws.org/Vol-2283/MediaEval_18_paper_4.pdf
|volume=Vol-2283
|authors=Emmanuel Dellandréa,Martijn Huigsloot,Liming Chen,Yoann Baveye,Zhongzhe Xiao,Mats Sjöberg
|dblpUrl=https://dblp.org/rec/conf/mediaeval/DellandreaH0BXS18
}}
==The MediaEval 2018 Emotional Impact of Movies Task==
<pdf width="1500px">https://ceur-ws.org/Vol-2283/MediaEval_18_paper_4.pdf</pdf>
<pre>
              The MediaEval 2018 Emotional Impact of Movies Task
      Emmanuel Dellandréa1 , Martijn Huigsloot2 , Liming Chen1 , Yoann Baveye3 , Zhongzhe Xiao4 and
                                             Mats Sjöberg5
                               1 Ecole Centrale de Lyon, France, {emmanuel.dellandrea, liming.chen}@ec-lyon.fr
                                                             2 NICAM, Netherlands, huigsloot@nicam.nl
                                                        3 Capacités, France, yoann.baveye@capacites.fr
                                                4 Soochow University, China, xiaozhongzhe@suda.edu.cn
                                                      5 Aalto University, Finland, mats.sjoberg@aalto.fi


ABSTRACT                                                                           the majority of the audience feels in response to the same con-
This paper provides a description of the MediaEval 2018 “Emotional                 tent. In other words, the expected emotion is the expected value
Impact of Movies task". It continues to build on last year’s edition,              of experienced (i.e. induced) emotion in a population. While the
integrating the feedback of previous participants. The goal is to                  induced emotion is subjective and context dependent, the expected
create systems that automatically predict the emotional impact that                emotion can be considered objective, as it reflects the more-or-less
video content will have on viewers, in terms of valence, arousal and               unanimous response of a general audience to a given stimulus [7].
fear. Here we provide a description of the use case, task challenges,                 This year, two scenarios are proposed as subtasks. In both cases,
dataset and ground truth, task run requirements and evaluation                     long movies are considered.
metrics.                                                                                (1) Valence and Arousal prediction: participants’ systems have
                                                                                            to predict a score of expected valence and arousal continu-
                                                                                            ously (every second) along movies. Valence is defined on a
1    INTRODUCTION                                                                           continuous scale from most negative to most positive emo-
Affective video content analysis aims at the automatic recognition                          tions, while arousal is defined continuously from calmest
of emotions elicited by videos. It has a large number of applications,                      to most active emotions [9];
including mood based personalized content recommendation [3] or                         (2) Fear detection: the purpose here is to predict beginning and
video indexing [12], and efficient movie visualization and browsing                         ending times of sequences inducing fear in movies. The
[13]. Beyond the analysis of existing video material, affective com-                        targeted use case is the detection of frightening scenes to
puting techniques can also be used to generate new content, e.g.,                           help systems protecting children from potentially harmful
movie summarization [8], or personalized soundtrack recommenda-                             video content.
tion to make user-generated videos more attractive [10]. Affective
techniques can also be used to enhance the user engagement with
                                                                                   3     DATA DESCRIPTION
advertising content by optimizing the way ads are inserted inside                  The dataset used in this task is the LIRIS-ACCEDE dataset1 . It
videos [11].                                                                       contains videos from a set of 160 professionally made and ama-
   While major progress has been achieved in computer vision for                   teur movies, shared under Creative Commons licenses that allow
visual object detection, scene understanding and high-level concept                redistribution [2]. Several movie genres are represented in this col-
recognition, a natural further step is the modeling and recogni-                   lection of movies such as horror, comedy, drama, action and so on.
tion of affective concepts. This has recently received increasing                  Languages are mainly English with a small set of Italian, Spanish,
interest from research communities, e.g., computer vision, machine                 French and others subtitled in English.
learning, with an overall goal of endowing computers with human-                      A total of 44 movies (total duration of 15 hours and 20 minutes)
like perception capabilities. Thus, this task is proposed to offer                 selected from the set of 160 movies are provided as development set
researchers a place to compare their approaches for the prediction                 for both subtasks with the annotations according to fear, valence
of the emotional impact of movies. It continues to build on last                   and arousal. A complementary set of 10 movies (11 hours and 29
year’s edition [5] integrating the feedback of participants. The task              minutes) is available for the first subtask with the valence and
consists of two subtasks, the first one being related to valence and               arousal annotations.
arousal prediction, and the second one to fear detection.                             The test set consists of 12 other movies selected from the set of
                                                                                   160 movies, for a total duration of 8 hours and 56 minutes.
2    TASK DESCRIPTION                                                                 In addition to the video data, participants are also provided with
                                                                                   general purpose audio and visual content features. To compute
The task requires participants to deploy multimedia features and
                                                                                   audio features, movies have first been processed to extract consec-
models to automatically predict the emotional impact of movies.
                                                                                   utive 5-second segments sliding over the whole movie with a shift
This emotional impact is considered here to be the prediction of
                                                                                   of 1 second. Then, audio features have been extracted from these
the expected emotion. The expected emotion is the emotion that
                                                                                   segments using openSmile toolbox2 [6]. The default configuration
                                                                                   named “emobase2010.conf" was used. It allows the computation of
Copyright held by the owner/author(s).
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                         1 http://liris-accede.ec-lyon.fr
                                                                                   2 http://audeering.com/technology/opensmile/
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                                                               E. Dellandréa et al.


1,582 features, which result from a base of 34 low-level descriptors    4.2     Ground Truth for the second subtask
(LLD) with 34 corresponding delta coefficients appended, and 21         Fear annotations for the second subtask were generated using a
functionals applied to each of these 68 LLD contours (1,428 fea-        tool specifically designed for the classification of audio-visual me-
tures). In addition, 19 functionals are applied to the 4 pitch-based    dia allowing to perform annotation while watching the movie (at
LLD and their four delta coefficient contours (152 features). Finally   the same time). The annotations have been realized by two well
the number of pitch onsets (pseudo syllables) and the total duration    experienced team members of NICAM5 both of them trained in
of the input are appended (2 features).                                 classification of media. Each movie has been annotated by 1 an-
   Beyond audio features, for each movie, image frames were ex-         notator reporting the start and stop times of each sequence in the
tracted every one second. For each of these images, several general     movie expected to induce fear.
purpose visual features have been provided. They have been com-
puted using LIRE library3 , except CNN features (VGG16 fc6 layer)       5     RUN DESCRIPTION
that have been extracted using Matlab Neural Networks toolbox4 .
                                                                        Participants can submit up to 5 runs for each of the two subtasks,
The visual features are the following: Auto Color Correlogram,
                                                                        so 10 runs in total. Models can rely on the features provided by the
Color and Edge Directivity Descriptor, Color Layout, Edge His-
                                                                        organizers or any other external data.
togram, Fuzzy Color and Texture Histogram, Gabor, Joint descriptor
joining CEDD and FCTH in one histogram, Scalable Color, Tamura,
                                                                        6     EVALUATION CRITERIA
Local Binary Patterns, VGG16 fc6 layer.
                                                                        Standard evaluation metrics are used to assess systems performance.
                                                                        The first subtask can be considered as a regression problem (esti-
4     GROUND TRUTH                                                      mation of expected valence and arousal scores) while the second
As mentioned in the previous section, the development set contains      subtask can be seen as a binary classification problem (the video
a part that is common to both subtasks with valence, arousal and        segment is supposed to induce/not induce fear).
fear annotations (44 movies), and an additional part only concerning       For the first subtask, the official metric is the Mean Square Error
the first subtask, with valence and arousal annotations (10 movies).    (MSE), which is the common measure generally used to evaluate
   For each movie from the development set for the first subtask,       regression models. However, to allow a deeper understanding of
a file is provided containing valence and arousal values for each       systems’ performance, we also consider Pearson’s Correlation Co-
second of the movie.                                                    efficient. Indeed, MSE is not always sufficient to analyze models
   Moreover, for all movies from the development set for the second     efficiency and the correlation may be required to obtain a deeper
subtask, a file is provided containing the beginning and ending         performance analysis. As an example, if a large portion of the data
times of each sequence in the movie inducing fear.                      is neutral (i.e., its valence score is close to 0.5) or is distributed
                                                                        around the neutral score, a uniform model that always outputs 0.5
                                                                        will result in good MSE performance (low MSE). In this case, the
4.1      Ground Truth for the first subtask
                                                                        lack of accuracy of the model will be brought to the fore by the
In order to collect continuous valence and arousal annotations, a       correlation between the predicted values and the ground truth that
total of 28 French participants had to continuously indicate their      will be also very low.
level of valence and arousal while watching the movies using a             For the second subtask, as the goal is to detect time sequences
modified version of the GTrace annotation tool [4] and a joystick.      inducing fear, the official metric is the Intersection over Union of
Each annotator continuously annotated one subset of the movies          time intervals.
considering the induced valence and another subset considering
the induced arousal, for a total duration of around 8 hours on 2        7     CONCLUSION
days. Thus, each movie has been continuously annotated by three
                                                                        The Emotional Impact of Movies Task provides participants with a
to five different annotators.
                                                                        comparative and collaborative evaluation framework for emotional
   Then, the continuous valence and arousal annotations from the
                                                                        detection in movies, in terms of valence, arousal and fear. The
participants have been down-sampled by averaging the annotations
                                                                        LIRIS-ACCEDE dataset has been used as development and test sets.
over windows of 10 seconds with a shift of 1 second overlap (i.e., 1
                                                                        Details on the methods and results of each individual team can be
value per second) in order to remove the noise due to unintended
                                                                        found in the papers of the participating teams in the MediaEval
movements of the joystick. Finally, these post-processed continuous
                                                                        2018 workshop proceedings.
annotations have been averaged in order to create a continuous
mean signal of the valence and arousal self-assessments, ranging
                                                                        ACKNOWLEDGMENTS
from -1 (most negative for valence, most passive for arousal) to +1
(most positive for valence, most active for arousal). The details of    This task is supported by the CHIST-ERA Visen project ANR-12-
this processing are given in [1].                                       CHRI-0002-04.

                                                                        REFERENCES
                                                                         [1] Y. Baveye, E. Dellandréa, C. Chamaret, and L. Chen. 2015. Deep
                                                                             Learning vs. Kernel Methods: Performance for Emotion Prediction
3 http://www.lire-project.net/
4 https://www.mathworks.com/products/neural-network.html
                                                                        5 http://www.kijkwijzer.nl/nicam
Emotional Impact of Movies Task                                                  MediaEval’18, 29-31 October 2018, Sophia Antipolis, France


     in Videos. In Humaine Association Conference on Affective Computing
     and Intelligent Interaction (ACII).
 [2] Y. Baveye, E. Dellandréa, C. Chamaret, and L. Chen. 2015. LIRIS-
     ACCEDE: A Video Database for Affective Content Analysis. IEEE
     Transactions on Affective Computing 6, 1 (2015), 43–55.
 [3] L. Canini, S. Benini, and R. Leonardi. 2013. Affective recommendation
     of movies based on selected connotative features. IEEE Transactions
     on Circuits and Systems for Video Technology 23, 4 (2013), 636–647.
 [4] R. Cowie, M. Sawey, C. Doherty, J. Jaimovich, C. Fyans, and P. Stapleton.
     2013. Gtrace: General trace program compatible with emotionml.. In
     Humaine Association Conference on Affective Computing and Intelligent
     Interaction (ACII).
 [5] E. Dellandréa, M. Huigsloot, L. Chen, Y. Baveye, and M. Sjöberg. 2017.
     The MediaEval 2017 Emotional Impact of Movies Task. In MediaEval
     2017 Workshop.
 [6] F. Eyben, F. Weninger, F. Gross, and B. Schuller. 2013. Recent Develop-
     ments in openSMILE, the Munich Open-Source Multimedia Feature
     Extractor. In ACM Multimedia (MM), Barcelona, Spain.
 [7] A. Hanjalic. 2006. Extracting moods from pictures and sounds: To-
     wards truly personalized TV. IEEE Signal Processing Magazine (2006).
 [8] H. Katti, K. Yadati, M. Kankanhalli, and C. TatSeng. 2011. Affective
     video summarization and story board generation using pupillary di-
     lation and eye gaze. In IEEE International Symposium on Multimedia
     (ISM).
 [9] J. A. Russell. 2003. Core affect and the psychological construction of
     emotion. Psychological Review (2003).
[10] R. R. Shah, Y. Yu, and R. Zimmermann. 2014. Advisor: Personalized
     video soundtrack recommendation by late fusion with heuristic rank-
     ings. In ACM International Conference on Multimedia.
[11] K. Yadati, H. Katti, and M. Kankanhalli. 2014. Cavva: Computational
     affective video-in-video advertising. IEEE Transactions on Multimedia
     16, 1 (2014), 15–23.
[12] S. Zhang, Q. Huang, S. Jiang, W. Gao, and Q. Tian. 2010. Affective
     visualization and retrieval for music video. IEEE Transactions on
     Multimedia 12, 6 (2010), 510–522.
[13] S. Zhao, H. Yao, X. Sun, X. Jiang, and P. Xu. 2013. Flexible presentation
     of videos based on affective content analysis. Advances in Multimedia
     Modeling 7732 (2013).

</pre>