=Paper= {{Paper |id=Vol-1739/MediaEval_2016_paper_6 |storemode=property |title=The MediaEval 2016 Emotional Impact of Movies Task |pdfUrl=https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_6.pdf |volume=Vol-1739 |dblpUrl=https://dblp.org/rec/conf/mediaeval/DellandreaCBSC16 }} ==The MediaEval 2016 Emotional Impact of Movies Task== https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_6.pdf
      The MediaEval 2016 Emotional Impact of Movies Task

 Emmanuel Dellandréa1 , Liming Chen1 , Yoann Baveye2 , Mats Sjöberg3 and Christel Chamaret4
                   1
                       Ecole Centrale de Lyon, France, {emmanuel.dellandrea, liming.chen}@ec-lyon.fr
                               2
                                 Université de Nantes, France, yoann.baveye@univ-nantes.fr
                               3
                                 HIIT, University of Helsinki, Finland, mats.sjoberg@helsinki.fi
                                  4
                                    Technicolor, France, christel.chamaret@technicolor.com



ABSTRACT                                                         ample what the viewer believes that he or she is expected
This paper provides a description of the MediaEval 2016          to feel. The emotion is considered in terms of valence and
”Emotional Impact of Movies” task. It continues builds on        arousal [8]. Valence is defined as a continuous scale from
previous years’ editions of the Affect in Multimedia Task:       most negative to most positive emotions, while arousal is
Violent Scenes Detection. However, in this year’s task, par-     defined continuously from calmest to most active emotions.
ticipants are expected to create systems that automatically      Two subtasks are considered:
predict the emotional impact that video content will have on
viewers, in terms of valence and arousal scores. Here we pro-         1. Global emotion prediction: given a short video clip
vide insights on the use case, task challenges, dataset and              (around 10 seconds), participants’ systems are expected
ground truth, task run requirements and evaluation met-                  to predict a score of induced valence (negative-positive)
rics.                                                                    and induced arousal (calm-excited) for the whole clip;

                                                                      2. Continuous emotion prediction: as an emotion felt dur-
1.   INTRODUCTION                                                        ing a scene may be influenced by the emotions felt dur-
   Affective video content analysis aims at the automatic                ing the previous ones, the purpose here is to consider
recognition of emotions elicited by videos. It has a large               longer videos, and to predict the valence and arousal
number of applications, including mood based personalized                continuously along the video. Thus, a score of induced
content recommendation [5] or video indexing [12], and ef-               valence and arousal should be provided for each 1s-
ficient movie visualization and browsing [13]. Beyond the                segment of the video.
analysis of existing video material, affective computing tech-
niques can also be used to generate new content, e.g., movie
summarization [7], or personalized soundtrack recommen-
                                                                 3.     DATA DESCRIPTION
dation to make user-generated videos more attractive [9].           The development dataset used in this task is the LIRIS-
Affective techniques can also be used to enhance the user        ACCEDE dataset (liris-accede.ec-lyon.fr) [3]. It is composed
engagement with advertising content by optimizing the way        of two subsets. The first one, used for the first subtask
ads are inserted inside videos [11].                             (global emotion prediction), contains 9,800 video clips ex-
   While major progress has been achieved in computer vi-        tracted from 160 professionally made and amateur movies,
sion for visual object detection, scene understanding and        with different genres, and shared under Creative Commons
high-level concept recognition, a natural further step is the    licenses that allows to freely use and distribute videos with-
modeling and recognition of affective concepts. This has re-     out copyright issues as long as the original creator is cred-
cently received increasing interest from research communi-       ited. The segmented video clips last between 8 and 12 sec-
ties, e.g., computer vision, machine learning, with an overall   onds and are representative enough to conduct experiments.
goal of endowing computers with human-like perception ca-        Indeed, the length of extracted segments is large enough to
pabilities. Thus, this task is proposed to offer researchers a   get consistent excerpts allowing the viewer to feel emotions
place to compare their approaches for the prediction of the      and is also small enough to make the viewer feel only one
emotional impact of movies. It continues builds on previous      emotion per excerpt. A robust shot and fade in/out detec-
years’ editions of the Affect in Multimedia Task: Violent        tion has been implemented using to make sure that each
Scenes Detection [10].                                           extracted video clip start and end with a shot or a fade.
                                                                 Several movie genres are represented in this collection of
                                                                 movies such as horror, comedy, drama, action and so on.
2.   TASK DESCRIPTION                                            Languages are mainly English with a small set of Italian,
  The task requires participants to deploy multimedia fea-       Spanish, French and others subtitled in English.
tures to automatically predict the emotional impact of movies.      The second part of LIRIS-ACCEDE dataset is used for the
We are focusing on felt emotion, i.e., the actual emotion of     second subtask (continuous emotion prediction). It consists
the viewer when watching the video, rather than for ex-          in a selection of movies among the 160 ones used to extract
                                                                 the 9,800 video clips mentioned previously. The total length
                                                                 of the selected movies was the only constraint. It had to be
Copyright is held by the author/owner(s).
MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether-    smaller than eight hours to create an experiment of accept-
lands.                                                           able duration. The selection process ended with the choice
of 30 movies so that their genre, content, language and du-      valence and the other into the induced arousal. Thus, each
ration are diverse enough to be representative of the original   movie has been continuously annotated by five annotators
LIRIS-ACCEDE dataset. The selected videos are between            for the development set, and three for the test set.
117 and 4,566 seconds long (mean = 884.2sec ± 766.7sec             Then, the continuous valence and arousal annotations from
SD). The total length of the 30 selected movies is 7 hours,      the participants have been down-sampled by averaging the
22 minutes and 5 seconds.                                        annotations over windows of 10 seconds with 1 second over-
  In addition to the development set, a test set is also pro-    lap (i.e., 1 value per second) in order to remove the noise
vided to assess participants’ methods performance. 49 new        due to unintended moves of the joystick. Finally, these post-
movies under Creative Commons licenses have been consid-         processed continuous annotations have been averaged in or-
ered. With the same protocol as the one used for the de-         der to create a continuous mean signal of the valence and
velopment set, 1,200 additional short video clips have been      arousal self-assessments. The details of this processing are
extracted for the first subtask (between 8 and 12 seconds),      given in [2].
and 10 long movies (from 25 minutes to 1 hour and 35 min-
utes) have been selected for the second subtask (for a total     5.      RUN DESCRIPTION
duration of 11.48 hours).
                                                                    Participants can submit up to 5 runs for the first subtask
  In solving the task, participants are expected to exploit
                                                                 (global emotion prediction). For the second subtask (con-
the provided resources. Use of external resources (e.g., In-
                                                                 tinuous emotion prediction), there can be 2 types of run
ternet data) will be however allowed as specific runs.
                                                                 submissions: full runs that concerns the whole test set (the
  Along with the video material and the annotations, fea-
                                                                 10 movies, total duration: 11.48 hours) and light runs that
tures extracted from each video clip are also provided by
                                                                 concern a subset of the test set (5 movies, total duration:
the organizers for the first subtask. They correspond to the
                                                                 4.82 hours). In each case (light and full), up to 5 runs can
audiovisual features described in [3].
                                                                 be submitted. Moreover, each subtask has a required run
                                                                 which uses no externale training data, only the provided de-
4.    GROUND TRUTH                                               velopment data is allowed. Also any features that can be
                                                                 automatically extracted from the video are allowed. Both
4.1   Ground Truth for the first subtask                         tasks also have the possibility for optional runs in which
   The 9,800 video clips included in the first part of the       any external data can be used, such as Internet sources, as
LIRIS-ACCEDE dataset are ranked along the felt valence           long as they are marked as ”external data” runs.
and arousal axes by using a crowdsourcing protocol [3]. To
make reliable annotations as simple as possible, pairwise        6.      EVALUATION CRITERIA
comparisons were generated using the quicksort algorithm
                                                                    Standard evaluation metrics (Mean Square Error and Pear-
and presented to crowdworkers who had to select the video
                                                                 son’s Correlation Coefficient) are used to assess systems per-
inducing the calmest emotion or the most positive emotion.
                                                                 formance. Indeed, the common measure generally used to
   To cross-validate the annotations gathered from various
                                                                 evaluate regression models is the Mean Square Error (MSE).
uncontrolled environments using crowdsourcing, another ex-
                                                                 However, this measure is not always sufficient to analyze
periment has been created to collect ratings for a subset of
                                                                 models efficiency and the correlation may be required to ob-
the database in a controlled environment. In this controlled
                                                                 tain a deeper performance analysis. As an example, if a
experiment, 28 volunteers were asked to rate a subset of the
                                                                 large portion of the data is neutral (i.e., its valence score is
database carefully selected using the 5-point discrete Self-
                                                                 close to 0.5) or is distributed around the neutral score, a uni-
Assessment-Manikin scales for valence and arousal [4]. 20
                                                                 form model that always outputs 0.5 will result in good MSE
excerpts per axis that are regularly distributed have been
                                                                 performance (low MSE). In this case, the lack of accuracy
selected in order to get enough excerpts to represent the
                                                                 of the model will be brought to the fore by the correlation
whole database while being relatively few to create an ex-
                                                                 between the predicted values and the ground truth that will
periment of acceptable duration.
                                                                 be also very low.
   From the original ranks and these ratings, absolute affec-
tive scores for valence and arousal have been estimated for
each of the 9,800 video clips using Gaussian process regres-     7.      CONCLUSIONS
sion models as described in [1].                                    The Emotional Impact of Movies Task provides partici-
   To obtain ground truth for the test subset, each of the       pants with a comparative and collaborative evaluation frame-
1,200 additional video clips has first been ranked according     work for emotional detection in movies, in terms of valence
to the 9,800 video clips from the original dataset. Then,        and arousal scores. The LIRIS-ACCEDE dataset 1 has been
its valence and arousal ranks have been converted into a         used as development set, and additional movies under Cre-
valence and arousal score using the regression models men-       ative Commons licenses and ground truth annotations have
tioned previously.                                               been provided as test set. Details on the methods and re-
                                                                 sults of each individual team can be found in the papers
4.2   Ground Truth for the second subtask                        of the participating teams in the MediaEval 2016 workshop
   In order to collect continuous valence and arousal anno-      proceedings.
tations, 16 French participants had to continuously indicate
their level of arousal while watching the movies using a mod-    8.      ACKNOWLEDGMENTS
ified version of the GTrace annotation tool [6] and a joystick
                                                                  This task is supported by the CHIST-ERA Visen project
(10 participants for the development set and 6 for the test
                                                                 ANR-12-CHRI-0002-04.
set). Movies have been divided into two subsets. Each anno-
                                                                 1
tator continuously annotated one subset along the induced            http://liris-accede.ec-lyon.fr
9.   REFERENCES
 [1] Y. Baveye, E. Dellandréa, C. Chamaret, and L. Chen.
     From crowdsourced rankings to affective ratings. In
     IEEE International Conference on Multimedia and
     Expo Workshops (ICMEW), 2014.
 [2] Y. Baveye, E. Dellandréa, C. Chamaret, and L. Chen.
     Deep learning vs. kernel methods: Performance for
     emotion prediction in videos. In Humaine Association
     Conference on Affective Computing and Intelligent
     Interaction (ACII), 2015.
 [3] Y. Baveye, E. Dellandréa, C. Chamaret, and L. Chen.
     Liris-accede: A video database for affective content
     analysis. IEEE Transactions on Affective Computing,
     2015.
 [4] M. M. Bradley and P. J. Lang. Measuring emotion:
     the self-assessment manikin and the semantic
     differential. Journal of behavior therapy and
     experimental psychiatry, 1994.
 [5] L. Canini, S. Benini, and R. Leonardi. Affective
     recommendation of movies based on selected
     connotative features. IEEE Transactions on Circuits
     and Systems for Video Technology, 2013.
 [6] R. Cowie, M. Sawey, C. Doherty, J. Jaimovich,
     C. Fyans, and P. Stapleton. Gtrace: General trace
     program compatible with emotionml. In Humaine
     Association Conference on Affective Computing and
     Intelligent Interaction (ACII), 2013.
 [7] H. Katti, K. Yadati, M. Kankanhalli, and C. TatSeng.
     Affective video summarization and story board
     generation using pupillary dilation and eye gaze. In
     IEEE International Symposium on Multimedia (ISM),
     2011.
 [8] J. A. Russell. Core affect and the psychological
     construction of emotion. Psychological Review, 2003.
 [9] R. R. Shah, Y. Yu, and R. Zimmermann. Advisor:
     Personalized video soundtrack recommendation by
     late fusion with heuristic rankings. In ACM
     International Conference on Multimedia, 2014.
[10] M. Sjöberg, Y. Baveye, H. Wang, V. Quang,
     B. Ionescu, E. Dellandréa, M. Schedl, C.-H. Demarty,
     and L. Chen. The mediaeval 2015 affective impact of
     movies task. In MediaEval 2015 Workshop, 2015.
[11] K. Yadati, H. Katti, and M. Kankanhalli. Cavva:
     Computational affective video-in-video advertising.
     IEEE Transactions on Multimedia, 2014.
[12] S. Zhang, Q. Huang, S. Jiang, W. Gao, and Q. Tian.
     Affective visualization and retrieval for music video.
     IEEE Transactions on Multimedia, 2010.
[13] S. Zhao, H. Yao, X. Sun, X. Jiang, and P. Xu. Flexible
     presentation of videos based on affective content
     analysis. Advances in Multimedia Modeling, 2013.