=Paper= {{Paper |id=Vol-1984/Mediaeval_2017_paper_5 |storemode=property |title=The MediaEval 2017 Emotional Impact of Movies Task |pdfUrl=https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_5.pdf |volume=Vol-1984 |authors=Emmanuel Dellandréa,Martijn Huigsloot,Liming Chen,Yoann Baveye,Mats Sjöberg |dblpUrl=https://dblp.org/rec/conf/mediaeval/DellandreaH0BS17 }} ==The MediaEval 2017 Emotional Impact of Movies Task== https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_5.pdf
              The MediaEval 2017 Emotional Impact of Movies Task
        Emmanuel Dellandréa1 , Martijn Huigsloot2 , Liming Chen1 , Yoann Baveye3 and Mats Sjöberg4
                               1 Ecole Centrale de Lyon, France, {emmanuel.dellandrea, liming.chen}@ec-lyon.fr
                                                      2 NICAM, Netherlands, Huigsloot@nicam.nl
                                             3 Université de Nantes, France, yoann.baveye@univ-nantes.fr
                                           4 HIIT, University of Helsinki, Finland, mats.sjoberg@helsinki.fi


ABSTRACT                                                                      emotion can be considered objective, as it reflects the more-or-less
This paper provides a description of the MediaEval 2017 “Emotional            unanimous response of a general audience to a given stimulus [8].
Impact of Movies task". It continues to build on previous years’                 This year, two new scenarios are proposed as subtasks. In both
editions. In this year’s task, participants are expected to create            cases, long movies are considered and the emotional impact has to
systems that automatically predict the emotional impact that video            be predicted for consecutive 10-second segments sliding over the
content will have on viewers, in terms of valence, arousal and fear.          whole movie with a shift of 5 seconds:
Here we provide a description of the use case, task challenges,                     (1) Valence/Arousal prediction: participants’ systems are sup-
dataset and ground truth, task run requirements and evaluation                          posed to predict a score of expected valence and arousal
metrics.                                                                                for each consecutive 10-second segments. Valence is de-
                                                                                        fined as a continuous scale from most negative to most
1    INTRODUCTION                                                                       positive emotions, while arousal is defined continuously
                                                                                        from calmest to most active emotions [10];
Affective video content analysis aims at the automatic recognition                  (2) Fear prediction: the purpose here is to predict for each
of emotions elicited by videos. It has a large number of applications,                  consecutive 10-second segments whether they are likely to
including mood based personalized content recommendation [3] or                         induce fear or not. The targeted use case is the prediction
video indexing [13], and efficient movie visualization and browsing                     of frightening scenes to help systems protecting children
[14]. Beyond the analysis of existing video material, affective com-                    from potentially harmful video content. This subtask is
puting techniques can also be used to generate new content, e.g.,                       complementary to the valence/arousal prediction task in
movie summarization [9], or personalized soundtrack recommenda-                         the sense that the mapping of discrete emotions into the
tion to make user-generated videos more attractive [11]. Affective                      2D valence/arousal space is often overlapped (for instance,
techniques can also be used to enhance the user engagement with                         fear, disgust and anger are overlapped since they are char-
advertising content by optimizing the way ads are inserted inside                       acterized with very negative valence and high arousal)
videos [12].                                                                            [7].
   While major progress has been achieved in computer vision for
visual object detection, scene understanding and high-level concept           3     DATA DESCRIPTION
recognition, a natural further step is the modeling and recogni-
tion of affective concepts. This has recently received increasing             The dataset used in this task is the LIRIS-ACCEDE
interest from research communities, e.g., computer vision, machine            dataset1 . It contains videos from a set of 160 professionally made
learning, with an overall goal of endowing computers with human-              and amateur movies, shared under Creative Commons licenses that
like perception capabilities. Thus, this task is proposed to offer            allow redistribution [2]. Several movie genres are represented in
researchers a place to compare their approaches for the prediction            this collection of movies such as horror, comedy, drama, action and
of the emotional impact of movies. It continues to build on previous          so on. Languages are mainly English with a small set of Italian,
years’ editions [5] with a first subtask, which is a mix of last year’s       Spanish, French and others subtitled in English.
tasks related to valence and arousal prediction, and a new subtask               The continuous part of LIRIS-ACCEDE [1] is used as the devel-
dedicated to fear prediction.                                                 opment test for both subtasks. It consists of a selection of 30 movies.
                                                                              The selected videos are between 117 and 4,566 seconds long (mean
2    TASK DESCRIPTION                                                         = 884.2sec ± 766.7sec SD). The total length of the 30 selected movies
                                                                              is 7 hours, 22 minutes and 5 seconds.
The task requires participants to deploy multimedia features and
                                                                                 The test set consists of a selection of 14 movies other than the
models to automatically predict the emotional impact of movies.
                                                                              selection of the 160 original movies. They are between 210 and 6,260
This emotional impact is considered here to be the prediction of
                                                                              seconds long (mean = 2045.2sec ± 2450.1sec SD). The total length
the expected emotion. The expected emotion is the emotion that
                                                                              of the 14 selected movies is 7 hours, 57 minutes and 13 seconds.
the majority of the audience feels in response to the same con-
                                                                                 In addition to the video data, participants are also provided with
tent. In other words, the expected emotion is the expected value
                                                                              general purpose audio and visual content features. To compute
of experienced (i.e. induced) emotion in a population. While the
                                                                              audio features, movies have first been processed to extract consec-
induced emotion is subjective and context dependent, the expected
                                                                              utive 10-second segments sliding over the whole movie with a shift
Copyright held by the owner/author(s).                                        of 5 seconds. Then, audio features have been extracted from these
MediaEval’17, 13-15 September 2017, Dublin, Ireland
                                                                              1 http://liris-accede.ec-lyon.fr
MediaEval’17, 13-15 September 2017, Dublin, Ireland                                                                       E. Dellandréa et al.


segments using openSmile toolbox2 [6]. The default configuration         4.2     Ground Truth for the second subtask
named “emobase2010.conf" was used. It allows the computation of          Fear annotations for the second subtask were generated using a
1,582 features, which result from a base of 34 low-level descriptors     tool specifically designed for the classification of audio-visual me-
(LLD) with 34 corresponding delta coefficients appended, and 21          dia allowing to perform annotation while watching the movie (at
functionals applied to each of these 68 LLD contours (1 428 fea-         the same time). The annotations have been realized by two well
tures). In addition, 19 functionals are applied to the 4 pitch-based     experienced team members of NICAM5 both of them trained in
LLD and their four delta coefficient contours (152 features). Finally    classification of media. Each movie has been annotated by 1 annota-
the number of pitch onsets (pseudo syllables) and the total duration     tor reporting the start and stop times of each sequence in the movie
of the input are appended (2 features).                                  exptected to induce fear. From this information, the 10-second
   Beyond audio features, for each movie, image frames were ex-          segments sliding over the whole movie with a shift of 5 seconds
tracted every one second. For each of these images, several general      have been labeled as fear (value 1) if they intersect one of the fear
purpose visual features have been provided. They have been com-          sequences and as not fear (value 0) otherwise.
puted using LIRE library3 , except CNN features (VGG16 fc6 layer)
that have been extracted using Matlab Neural Networks toolbox4 .         5     RUN DESCRIPTION
The visual features are the following: Auto Color Correlogram,
                                                                         Participants can submit up to 5 runs for each of the two subtasks,
Color and Edge Directivity Descriptor, Color Layout, Edge His-
                                                                         so 10 runs in total. Models can rely on the features provided by the
togram, Fuzzy Color and Texture Histogram, Gabor, Joint descriptor
                                                                         organizers or any other external data.
joining CEDD and FCTH in one histogram, Scalable Color, Tamura,
Local Binary Patterns, VGG16 fc6 layer.
                                                                         6     EVALUATION CRITERIA
                                                                         Standard evaluation metrics are used to assess systems performance.
4     GROUND TRUTH                                                       The first subtask can be considered as a regression problem (esti-
Annotations are provided to participants for the 30 movies from the      mation of expected valence and arousal scores) while the second
development set. Thus, for each movie, a first file contains valence     subtask can be seen as a binary classification problem (the video
and arousal values for consecutive 10-second segments sliding over       segment is supposed to induce/not induce fear).
the whole movie with a shift of 5 seconds, and a second file contains       For the first subtask, the official metric is the Mean Square Error
the indication whether these segments are supposed to induce fear        (MSE), which is the common measure generally used to evaluate
(value 1) or not (value 0).                                              regression models. However, to allow a deeper understanding of
                                                                         systems’ performance, we also consider Pearson’s Correlation Co-
4.1      Ground Truth for the first subtask                              efficient. Indeed, MSE is not always sufficient to analyze models
In order to collect continuous valence and arousal annotations,          efficiency and the correlation may be required to obtain a deeper
16 French participants had to continuously indicate their level of       performance analysis. As an example, if a large portion of the data
arousal while watching the movies using a modified version of the        is neutral (i.e., its valence score is close to 0.5) or is distributed
GTrace annotation tool [4] and a joystick (10 participants for the de-   around the neutral score, a uniform model that always outputs 0.5
velopment set and 6 for the test set). Movies have been divided into     will result in good MSE performance (low MSE). In this case, the
two subsets. Each annotator continuously annotated one subset            lack of accuracy of the model will be brought to the fore by the
considering the induced valence and the other subset considering         correlation between the predicted values and the ground truth that
the induced arousal. Thus, each movie has been continuously an-          will be also very low.
notated by five annotators for the development set, and three for           For the second subtask, the official metric is the Mean Aver-
the test set.                                                            age Precision (MAP). Moreover, Accuracy, Precision, Recall and
   Then, the continuous valence and arousal annotations from the         F1-score are also considered to provide insights into systems be-
participants have been down-sampled by averaging the annotations         haviours.
over windows of 10 seconds with a shift of 1 second overlap (i.e., 1
value per second) in order to remove the noise due to unintended         7     CONCLUSIONS
movements of the joystick. Finally, these post-processed continuous      The Emotional Impact of Movies Task provides participants with a
annotations have been averaged in order to create a continuous           comparative and collaborative evaluation framework for emotional
mean signal of the valence and arousal self-assessments. The details     detection in movies, in terms of valence, arousal and fear. The
of this processing are given in [1]. For the purpose of the first        LIRIS-ACCEDE dataset has been used as development and test sets.
subtask, these values have been averaged to obtain a single value        Details on the methods and results of each individual team can be
of valence and a single value of arousal for every consecutive 10-       found in the papers of the participating teams in the MediaEval
second segments sliding over the whole movie with a shift of 5           2017 workshop proceedings.
seconds.
                                                                         ACKNOWLEDGMENTS
                                                                         This task is supported by the CHIST-ERA Visen project ANR-12-
2 http://audeering.com/technology/opensmile/
                                                                         CHRI-0002-04.
3 http://www.lire-project.net/
4 https://www.mathworks.com/products/neural-network.html                 5 http://www.kijkwijzer.nl/nicam
Emotional Impact of Movies Task                                                  MediaEval’17, 13-15 September 2017, Dublin, Ireland


REFERENCES
 [1] Y. Baveye, E. Dellandréa, C. Chamaret, and L. Chen. 2015. Deep
     Learning vs. Kernel Methods: Performance for Emotion Prediction
     in Videos. In Humaine Association Conference on Affective Computing
     and Intelligent Interaction (ACII).
 [2] Y. Baveye, E. Dellandréa, C. Chamaret, and L. Chen. 2015. LIRIS-
     ACCEDE: A Video Database for Affective Content Analysis. IEEE
     Transactions on Affective Computing 6, 1 (2015), 43–55.
 [3] L. Canini, S. Benini, and R. Leonardi. 2013. Affective recommendation
     of movies based on selected connotative features. IEEE Transactions
     on Circuits and Systems for Video Technology 23, 4 (2013), 636–647.
 [4] R. Cowie, M. Sawey, C. Doherty, J. Jaimovich, C. Fyans, and P. Stapleton.
     2013. Gtrace: General trace program compatible with emotionml.. In
     Humaine Association Conference on Affective Computing and Intelligent
     Interaction (ACII).
 [5] E. Dellandréa, L. Chen, Y. Baveye, M. Sjöberg, and C. Chamaret. 2016.
     The MediaEval 2016 Emotional Impact of Movies Task. In MediaEval
     2016 Workshop.
 [6] F. Eyben, F. Weninger, F. Gross, and B. Schuller. 2013. Recent Develop-
     ments in openSMILE, the Munich Open-Source Multimedia Feature
     Extractor. In ACM Multimedia (MM), Barcelona, Spain.
 [7] L.-A. Feldman. 1995. Valence focus and arousal focus: Individual
     differences in the structure of affective experience. 69 (1995), 153–
     166.
 [8] A. Hanjalic. 2006. Extracting moods from pictures and sounds: To-
     wards truly personalized TV. IEEE Signal Processing Magazine (2006).
 [9] H. Katti, K. Yadati, M. Kankanhalli, and C. TatSeng. 2011. Affective
     video summarization and story board generation using pupillary di-
     lation and eye gaze. In IEEE International Symposium on Multimedia
     (ISM).
[10] J. A. Russell. 2003. Core affect and the psychological construction of
     emotion. Psychological Review (2003).
[11] R. R. Shah, Y. Yu, and R. Zimmermann. 2014. Advisor: Personalized
     video soundtrack recommendation by late fusion with heuristic rank-
     ings. In ACM International Conference on Multimedia.
[12] K. Yadati, H. Katti, and M. Kankanhalli. 2014. Cavva: Computational
     affective video-in-video advertising. IEEE Transactions on Multimedia
     16, 1 (2014), 15–23.
[13] S. Zhang, Q. Huang, S. Jiang, W. Gao, and Q. Tian. 2010. Affective
     visualization and retrieval for music video. IEEE Transactions on
     Multimedia 12, 6 (2010), 510–522.
[14] S. Zhao, H. Yao, X. Sun, X. Jiang, and P. Xu. 2013. Flexible presentation
     of videos based on affective content analysis. Advances in Multimedia
     Modeling 7732 (2013).