The MediaEval 2015 Affective Impact of Movies Task

               Mats Sjöberg1, Yoann Baveye2, Hanli Wang3, Vu Lam Quang4, Bogdan Ionescu5,
             Emmanuel Dellandréa6, Markus Schedl7, Claire-Hélène Demarty2, and Liming Chen6
         1
          Helsinki Institute for Information Technology HIIT, University of Helsinki, Finland, mats.sjoberg@helsinki.fi
                          2
                           Technicolor, France, [yoann.baveye,claire-helene.demarty]@technicolor.com
                                          3
                                          Tongji University, China, hanliwang@tongji.edu.cn
                              4
                              University of Science, VNU-HCMC, Vietnam, lamquangvu@gmail.com
                              5
                                  University Politehnica of Bucharest, Romania, bionescu@imag.pub.ro
                  6
                   Ecole Centrale de Lyon, France, emmanuel.dellandrea@ec-lyon.fr, liming.chen@liris.cnrs.fr
                                   7
                                    Johannes Kepler University, Linz, Austria, markus.schedl@jku.at


ABSTRACT                                                              of a constrained and closed data set [4, 7, 10]. Hence, the
This paper provides a description of the MediaEval 2015               task’s main objective is to propose a public common eval-
“Affective Impact of Movies Task”, which is running for the           uation framework for the research in these closely-related
fifth year, previously under the name “Violent Scenes Detec-          areas.
tion”. In this year’s task, participants are expected to create
systems that automatically detect video content that depicts          2.    TASK DESCRIPTION
violence, or predict the affective impact that video content             The task requires participants to deploy multimedia fea-
will have on viewers. Here we provide insights on the use             tures to automatically detect violent content and emotional
case, task challenges, data set and ground truth, task run            impact of short movie clips. In contrast to previous years,
requirements and evaluation metrics.                                  the task no longer considers arbitrary starting and ending
                                                                      points of detected segments, but instead the short video clips
1.    INTRODUCTION                                                    are considered as single units for detection purposes with a
                                                                      single judgement per clip. This year, there are two subtasks:
  The Affective Impact of Movies Task is part of the Media-
                                                                      (i) induced affect detection, and (ii) violence detection. Both
Eval 2015 Benchmarking Initiative. The overall use case
                                                                      tasks use the same videos for training and testing.
scenario of the task is to design a video search system that
                                                                         For the induced affect detection task, participants are ex-
uses automatic tools to help users find videos that fit their
                                                                      pected to predict, for each video, its valence class (i.e., into
particular mood, age or preferences. To address this, we
                                                                      one of negative, neutral or positive) and arousal class (i.e.,
present two subtasks:
                                                                      into one of calm, neutral or active). In this task, we are fo-
     • Induced affect detection: the emotional impact of a            cusing on felt emotion, i.e., the actual emotion of the viewer
       video or movie can be a strong indicator for search or         when watching the video clip, rather than for example what
       recommendation;                                                the viewer believes that he or she is expected to feel. Valence
     • Violence detection: detecting violent content is an im-        is defined as a continuous scale from most negative to most
       portant aspect of filtering video content based on age.        positive emotion, while arousal is defined continuously from
                                                                      most calm to most active emotion. However, to keep the two
   This task builds on the experiences from previous years’           subtasks compatible and enable participants to use similar
editions of the Affect in Multimedia Task: Violent Scenes             systems for both tasks, we have here opted to discretise the
Detection. However, this year, we introduce a completely              two scales into three classes as follows:
new subtask for detecting the emotional impact of movies.                  • valence: negative, neutral, and positive,
In addition, we are introducing to MediaEval a newly ex-                   • arousal: calm, neutral, and active.
tended data set consisting of 10,900 short video clips ex-
tracted from 199 Creative Commons-licensed movies.                      For the violence detection task, participants are expected
   In the literature, detection of violence in movies has been        to classify each video as violent or non-violent. Violence is
marginally addressed until recently [8, 6, 1]. Similarly, in          defined as content that “one would not let an 8 years old
affective video content analysis it has been repeatedly claimed       child see in a movie because it contains physical violence”.
that the field would highly benefit from a standardised eval-           To solve the task, participants are only allowed to use
uation data set [5, 9]. Most of the previously proposed meth-         features extracted from the original video files, or metadata
ods for affective impact or violence detection suffer from a          provided by the organisers. In addition, there is a possibility
lack of a consistent evaluation, which usually requires the use       to use external data for runs which are specifically marked,
                                                                      however, at least one run for each subtask must be without
                                                                      any external data.
Copyright is held by the author/owner(s).
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany
3.   DATA DESCRIPTION                                              different countries, and three judgements were collected for
   This year a single data set is proposed: 10,900 short video     each pivot/affect dimension pair. Out of these three judge-
clips extracted from 199 Creative Commons-licensed movies          ments the majority vote was selected.
of various genres. The movies are split into a development            For the violence detection the annotation process was sim-
set – intended for training and validation – and a test set as     ilar to previous years’ protocol. Firstly, all the videos were
100, respectively 99 movies, resulting in 6,144 respectively       annotated separately by two groups of annotators from two
4,756 extracted short video clips.                                 different countries. For each group, regular annotators la-
   The proposed data set is actually an extension of the           belled all the videos which were then reviewed by master an-
LIRIS-ACCEDE data set originally composed of 9,800 ex-             notators. Regular annotators were graduate students (typ-
cerpts extracted from 160 movies [3]. For this task, 1,100 ad-     ically single with no children) and master annotators were
ditional video clips have been extracted from 39 new movies        senior researchers (typically married with children). No dis-
and included in the test set. The selected feature films and       cussions were held between annotators during the annota-
short films can be considered professionally made or ama-          tion process. Group 1 used 12 regular and 2 master annota-
teur movies but almost all are indexed on video platforms          tors, while Group 2 used 5 regular and 2 master annotators.
referencing best free-to-share movies or have been screened        Within each group, each video received 2 different annota-
during film festivals. Since these movies are shared under         tions which were then merged by the master annotators into
Creative Commons licenses, the excerpts can also be shared         the final annotation for the group. Finally, the achieved an-
and downloaded along with the annotations without infring-         notations from the two groups were merged and reviewed
ing copyright. The excerpts have been extracted from the           once more by the task organisers.
movies so that they last between 8 and 12 seconds and start
and end with a cut or a fade.                                      5.     RUN DESCRIPTION
   Along with the video material and the annotations, fea-           Participants can submit up to 5 runs for each subtask: in-
tures extracted from each video clip are also provided by          duced affect detection and violence detection. Each subtask
the organisers. They correspond to the audiovisual features        has a required run which uses no external training data, only
described in [3].                                                  the provided development data is allowed. Also any features
                                                                   that can be automatically extracted from the video are al-
4.   GROUND TRUTH                                                  lowed. Both tasks also have the possibility for optional runs
   For each of the 10,900 video clips, the ground truth con-       in which any external data can be used, such as Internet
sists of: a binary value to indicate the presence of violence,     sources, as long as they are marked as ”external data” runs.
the class of the excerpt for felt arousal (calm-neutral-active),
and the class for felt valence (negative-neutral-positive). Be-    6.     EVALUATION CRITERIA
fore the evaluation, participants are provided only with the         For the induced affect detection subtask the official eval-
annotations for the development set, while those for the test      uation measure is global accuracy, calculated separately for
set are held back to be used for benchmarking the submitted        valence and arousal dimensions. Global accuracy is the pro-
results.                                                           portion of the returned video clips that have been assigned
   The original video clips included in the LIRIS-ACCEDE           to the correct class (out of the three classes).
data set were all already ranked along the felt valence and          The official evaluation metric for the violence detection
arousal axes by using a crowdsourcing protocol [3]. Pairwise       subtask is average precision, which is calculated using the
comparisons were generated using the quicksort algorithm           trec_eval tool provided by NIST1 . This tool also produces
and presented to crowdworkers who had to select the video          a set of commonly used metrics such as precision and recall,
inducing the calmer emotion or the more positive emotion.          which may be used for comparison purposes.
In [2] the crowdsourced ranks were converted into absolute
affective scores ranging from -1 to 1, which have been used to     7.     CONCLUSIONS
define the three classes for each affective axis for the Media-       The Affective Impact of Movies Task provides participants
Eval task. The negative and calm classes correspond re-            with a comparative and collaborative evaluation framework
spectively to the video clips with a valence or arousal score      for violence and emotion detection in movies. The introduc-
smaller than -0.15, the neutral class for both axes is assigned    tion of the induced affect detection subtask is a new effort for
to the videos with an affective score between -0.15 and 0.15,      this year. In addition, we have started fresh with a data set
and the positive and active classes are assigned to the videos     not used in MediaEval before, which consists of short Cre-
with an affective score higher than 0.15. These limits have        ative Commons-licensed video clips, which enables legally
been defined empirically taking into account the distribution      sharing the data directly with participants. Details on the
of the data set in the valence-arousal space.                      methods and results of each individual team can be found in
   For the 2015 MediaEval evaluation the test set was ex-          the papers of the participating teams in these proceedings.
tended with an additional 1,100 video clips. Due to time
and resource constraints, these were annotated using a sim-        Acknowledgments
plified scheme which takes advantage of the fact that we do        This task is supported by the following projects: ERA-
not need a full ranking of the new video clips, but only to        NET CHIST-ERA grant ANR-12-CHRI-0002-04, UEFIS-
separate them into three classes for each affect axis. Two         CDI SCOUTER grant 28DPST/30-08-2013, Vietnam Na-
pivot videos were selected for each axis, which had absolute       tional University Ho Chi Minh City grant B2013-26-01, Aus-
scores very close to the -0.15 and 0.15 class boundaries. The      trian Science Fund P25655, and EU FP7-ICT-2011-9 project
annotation task could then be formulated as comparing each         601166.
video clip to these pivot videos, and thus place them in their
                                                                   1
correct class. In total 17 annotators were involved from five          http://trec.nist.gov/trec_eval/
8.   REFERENCES
 [1] E. Acar, F. Hopfgartner, and S. Albayrak. Violence
     detection in hollywood movies by the fusion of visual
     and mid-level audio cues. In Proceedings of the 21st
     ACM international conference on Multimedia, pages
     717–720. ACM, 2013.
 [2] Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen.
     From crowdsourced rankings to affective ratings. In
     2014 IEEE International Conference on Multimedia
     and Expo Workshops (ICMEW), pages 1–6, July 2014.
 [3] Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen.
     LIRIS-ACCEDE: A video database for affective
     content analysis. IEEE Transactions on Affective
     Computing, 6(1):43–55, Jan 2015.
 [4] A. Hanjalic and L.-Q. Xu. Affective video content
     representation and modeling. IEEE Transactions on
     Multimedia, 7(1):143–154, Feb. 2005.
 [5] M. Horvat, S. Popovic, and K. Cosic. Multimedia
     stimuli databases usage patterns: a survey report. In
     Proceedings of the 36nd International ICT Convention
     MIPRO, pages 993–997, 2013.
 [6] B. Ionescu, J. Schlüter, I. Mironica, and M. Schedl. A
     naive mid-level concept-based fusion approach to
     violence detection in hollywood movies. In ICMR,
     pages 215–222, 2013.
 [7] G. Irie, T. Satou, A. Kojima, T. Yamasaki, and
     K. Aizawa. Affective audio-visual words and latent
     topic driving model for realizing movie affective scene
     classification. IEEE Transactions on Multimedia,
     12(6):523–535, Oct. 2010.
 [8] C. Penet, C.-H. Demarty, G. Gravier, and P. Gros.
     Multimodal Information Fusion and Temporal
     Integration for Violence Detection in Movies. In
     ICASSP, Kyoto, Japon, 2012.
 [9] M. Soleymani, M. Larson, T. Pun, and A. Hanjalic.
     Corpus development for affective video indexing.
     IEEE Transactions on Multimedia, 16(4):1075–1089,
     June 2014.
[10] S. Zhang, Q. Huang, S. Jiang, W. Gao, and Q. Tian.
     Affective visualization and retrieval for music video.
     IEEE Transactions on Multimedia, 12(6):510–522,
     Oct. 2010.