∗
The MediaEval 2013 Affect Task: Violent Scenes Detection

           Claire-Hélène Demarty                        Cédric Penet                        Markus Schedl
                   Technicolor                           Technicolor                   Johannes Kepler University
                  Rennes, France                        Rennes, France                       Linz, Austria
                       claire-                     cedric.penet@technicolor.com             markus.schedl@jku.at
           helene.demarty@technicolor.com

                Bogdan Ionescu                         Vu Lam Quang                         Yu-Gang Jiang
             University Polytehnica of               University of Science,                  Fudan University
                    Bucharest                            VNU-HCMC                                China
                    Romania                                Vietnam                        yugang.jiang@gmail.com
                bionescu@imag.pub.ro                  lamquangvu@gmail.com


ABSTRACT                                                            The first one corresponds to the one used in previous years
This paper provides a description of the MediaEval 2013 Af-         and was chosen to be as objective as possible. This first
fect Task Violent Scenes Detection. This task, which is pro-        definition is the following: violence is defined as “physical
posed for the third year to the research community, derives         violence or accident resulting in human injury or pain”. In
directly from a Technicolor use case which aims at easing a         an attempt to better fit the use case, a second definition is
user’s selection process from a movie database. This task           proposed, according to which events of interest are “those
will therefore apply to movie content. We provide some              which one would not let an 8 years old child see, because
insight into the Technicolor use case, before giving details        they contain physical violence”. This year, contrary to the
on the task itself, which has seen some changes in 2013.            previous challenges, the different runs will alternatively al-
Dataset, annotations, and evaluation criteria as well as the        low the participants to use either only features extracted
required and optional runs are described.                           from the provided DVD, or to use also additional external
                                                                    data (e.g., extracted from the web).
1.   INTRODUCTION
The Affect Task Violent Scenes Detection is part of the Me-         3.   DATA DESCRIPTION
diaEval 2013 benchmarking initiative for multimedia evalu-          With respect to the use case, the dataset selected for the
ation. The objective is to automatically detect violent seg-        developed corpus is a set of 25 Hollywood movies that must
ments in movies. This challenge is proposed for the third           be purchased as DVDs by the participants. The movies are
year in the MediaEval benchmark. It derives from a use              of different genres and show different amounts of violence
case at Technicolor (http://www.technicolor.com), which in-         (from extremely violent movies to movies without violence).
volves helping users choose movies that are suitable for chil-      The content extractable from DVDs consists of information
dren in their family. The movies should be suitable in terms        from different modalities, namely, visual information, audio
of their violent content, e.g., for viewing by users’ fami-         signals and subtitles, and any additional metadata present
lies. Users select or reject movies by previewing parts of          in the DVDs. From these 25 movies, 18 are dedicated to the
the movies (i.e., scenes or segments) that include the most         training process: Armageddon, Billy Elliot, Eragon, Harry
violent moments. In the literature, the detection of violence       Potter 5, I am Legend, Leon, Midnight Express, Pirates of
was not a lot studied [2, 1, 3], until recently when it has         the Caribbean 1, Reservoir Dogs, Saving Private Ryan, The
gained interest. As most of the proposed methods suffer             Sixth Sense, the Wicker Man, Kill Bill 1, The Bourne Iden-
from a lack of a common and consistent database, and usu-           tity, the Wizard of Oz, Dead Poets Society, Fight Club and
ally use a limited developement set, the task was launched          Independance Day. The remaining 7 movies, Fantastic Four,
to propose a public and common framework for the research           Fargo, Forrest Gump, Legally Blond, Pulp Fiction, The God
community. This year, among other changes, two definitions          Father 1 and The Pianist, will serve as the evaluation set.
of violence will be studied, an objective one and a subjective      As in 2011 and 2012, we tried to respect the genre repar-
one (see below). The addition of a subjective definition was        tition (from extremely violent to non violent) both in the
motivated by the fact that the one from 2012 has proven to          training and evaluation sets.
lead to annotations which do not correspond to the use case.
2.   TASK DESCRIPTION                                               4.   GROUND TRUTH
The task requires participants to deploy multimodal features        The ground truth1 was created by several human assessors.
to automatically detect portions of movies containing violent       In addition to segments containing physical violence (with
material. For 2013, two definitions of violence are studied.        the two above definitions), annotations also include high-
                                                                    level concepts for the visual and the audio modalities. Each
∗This year, work has been supported, in part, by the Quaero
                                                                    1
Program http://www.quaero.org/.                                       The annotations, shot detections and key frames for this
                                                                    task were made available by the Fudan University, the Viet-
Copyright is held by the author/owner(s).                           nam University of Science, and Technicolor. Any publica-
MediaEval 2013 Workshop, October 17-19, 2013, Barcelona, Spain      tion using these data should acknowledge these institutions’
                                                                    contributions.
annotated violent segment contains only one action, when-         extraction, whereas in the second one, additional external
ever it is possible. In the cases where different actions are     data (e.g., extracted from the web) can be used. For the
overlapping, the whole segment is proposed with different         two segment-level runs, participants are required to, inde-
actions. This was indicated in the annotation files by adding     pendently of shot boundaries, provide violent segments for
the tag “multiple action scene”. Each violent segment is an-      each test movie. Once again, confidence scores should be
notated at frame level, i.e., it is defined by its starting and   added for each segment. Similarly to the shot-level runs,
ending video frame numbers.                                       the two segment-level runs differ from the type of data al-
                                                                  lowed for the classification: internal data from the DVDs
Seven visual and three audio concepts are provided: presence      only vs. internal data plus additional external data. In all
of blood, fights, presence of fire, presence of guns, presence    cases, confidence scores are compulsory, as they will be used
of cold weapons, car chases and gory scenes (for the video        for the evaluation metric. They will also allow to plot detec-
modality); presence of screams, gunshots and explosions (for      tion error trade-off curves which should be of great interest
the audio modality). Participants should note that they are       to analyze and compare the different techniques. For both
welcome to carry out detection of the high-level concepts         subtasks, i.e., both violence definitions, the required run will
themselves. However, concept detection is not the goal of         be the run at shot-level without external data.
the task and these high-level concept annotations are only
provided for training purposes and only on the training set.      As a first step towards a qualitative evaluation, participants
For the video concepts, each of them follows the same an-         are encouraged to present at the MediaEval workshop a
notation format as for violent segments, i.e., starting and       video summary of the most violent scenes found by their al-
ending frame numbers and possibly some additional tags.           gorithms. This will not be evaluated by the organizers this
Regarding blood annotations, a proportion of the amount           year, but it will serve as a first basis for future evolution of
of blood in each segment is provided by the following tags:       the task.
unnoticeable, low, medium and high. Four different types of
fights are annotated: only two people fighting, a small group
                                                                  6.   EVALUATION CRITERIA
                                                                  As in 2012, the official evaluation metric will be the mean
of people (roughly less than 10), large group of people (more
                                                                  average precision at the N top ranked violent shots. Several
than 10), distant attack (i.e., no real fight but somebody is
                                                                  performance measures will be used for diagnostic purposes
shot or attacked at distance). As for the presence of fire,
                                                                  (false alarm and miss detection rates, AED-precision and
anything from big fires and explosions to fire coming out of a
                                                                  recall as defined in [4], the MediaEval cost, which is a func-
gun while shooting, a candle, a cigarette lighter, a cigarette,
                                                                  tion weighting false alarms (FA) and missed detections (MI),
or sparks was annotated, e.g., a space shuttle taking off also
                                                                  etc.). To avoid only evaluating systems at given operating
generates fire and thus receives a fire label. An additional
                                                                  points and enable full comparison of the pros and cons of
tag may indicate special colors of the fire (i.e., not yellow
                                                                  each system, we will use detection error trade-off (DET)
or orange). If a segment of video showed the presence of
                                                                  curves, plotting Pfa as a function of Pmiss given a segmen-
firearms (or cold weapons) it was annotated by any type of
                                                                  tation and a score for each segment, where the higher the
(parts of) guns (or cold weapons) or assimilated arms. By
                                                                  score, the more likely the violence. Pfa and Pmiss are re-
“cold weapon”, we mean any weapon that does not involve
                                                                  spectively the FA and MI rates given the system’s output
fire or explosions as a result from the use of gun powder or
                                                                  and the reference annotation. In the shot classification, the
other explosive materials. Annotations of gory scenes are
                                                                  FA and MI rates were calculated on a per shot basis while,
more delicate. In the present task, they are indicated by
                                                                  in the segment level run, they were computed on a per unit
graphic images of bloodletting and/or tissue damage. This
                                                                  of time basis, i.e., durations of both references and detected
includes horror or war representations. As this is also a sub-
                                                                  segments are compared. Note that in the segment level run,
jective and difficult notion to define, some additional seg-
                                                                  DET curves are possible only for systems returning a dense
ments showing really disgusting mutants or creatures are
                                                                  segmentation (a list of segments that spans the entire video).
annotated as gore. In this case, additional tags describing
                                                                  Segments not in the output list will be considered as non vi-
the event/scene are added. For the audio concepts, each
                                                                  olent for all thresholds.
temporal segment is annotated with its starting and end-
ing times in seconds, and an additional tag corresponding         7.   REFERENCES
to the type of event, chosen from the list: nothing, gun-         [1] B. Ionescu, J. Schlüter, I. Mironica, and M. Schedl. A
shot, canon fire, scream, scream effort, explosion, multiple          naive mid-level concept-based fusion approach to
actions, multiple actions canon fire, multiple actions scream         violence detection in hollywood movies. In ICMR,
effort. Automatically generated shot boundaries with their            pages 215–222, 2013.
corresponding key frames are also provided with each movie.       [2] C. Penet, C.-H. Demarty, G. Gravier, and P. Gros.
Shot segmentation was carried out by Technicolor’s software.          Multimodal Information Fusion and Temporal
5.   RUN DESCRIPTION                                                  Integration for Violence Detection in Movies. In
Participants can submit four types of runs: two of them               ICASSP, Kyoto, Japon, 2012.
are shot-classification runs and the others are segment-level     [3] F. D. M. d. Souza, G. C. Chavez, E. A. d. Valle Jr.,
runs. For the two shot-classification runs, participants are          and A. d. A. Araujo. Violence detection in video using
required to provide violent scene detection at the shot level,        spatio-temporal features. In SIBGRAPI ’10, pages
according to the provided shot boundaries. Each shot has              224–230, Washington, DC, USA, 2010.
to be classified as violent or non violent, with a confidence     [4] A. Temko, C. Nadeu, and J.-I. Biel. Acoustic Event
score. These two runs will differ in the data that can be used        Detection: SVM-Based System and Evaluation Setup
for the classification: for the first one, only the content of        in CLEAR’07. In Multimodal Technologies for
the movie extractable from the DVDs is allowed for feature            Perception of Humans, pages 354–363. 2008.