∗ The MediaEval 2013 Affect Task: Violent Scenes Detection Claire-Hélène Demarty Cédric Penet Markus Schedl Technicolor Technicolor Johannes Kepler University Rennes, France Rennes, France Linz, Austria claire- cedric.penet@technicolor.com markus.schedl@jku.at helene.demarty@technicolor.com Bogdan Ionescu Vu Lam Quang Yu-Gang Jiang University Polytehnica of University of Science, Fudan University Bucharest VNU-HCMC China Romania Vietnam yugang.jiang@gmail.com bionescu@imag.pub.ro lamquangvu@gmail.com ABSTRACT The first one corresponds to the one used in previous years This paper provides a description of the MediaEval 2013 Af- and was chosen to be as objective as possible. This first fect Task Violent Scenes Detection. This task, which is pro- definition is the following: violence is defined as “physical posed for the third year to the research community, derives violence or accident resulting in human injury or pain”. In directly from a Technicolor use case which aims at easing a an attempt to better fit the use case, a second definition is user’s selection process from a movie database. This task proposed, according to which events of interest are “those will therefore apply to movie content. We provide some which one would not let an 8 years old child see, because insight into the Technicolor use case, before giving details they contain physical violence”. This year, contrary to the on the task itself, which has seen some changes in 2013. previous challenges, the different runs will alternatively al- Dataset, annotations, and evaluation criteria as well as the low the participants to use either only features extracted required and optional runs are described. from the provided DVD, or to use also additional external data (e.g., extracted from the web). 1. INTRODUCTION The Affect Task Violent Scenes Detection is part of the Me- 3. DATA DESCRIPTION diaEval 2013 benchmarking initiative for multimedia evalu- With respect to the use case, the dataset selected for the ation. The objective is to automatically detect violent seg- developed corpus is a set of 25 Hollywood movies that must ments in movies. This challenge is proposed for the third be purchased as DVDs by the participants. The movies are year in the MediaEval benchmark. It derives from a use of different genres and show different amounts of violence case at Technicolor (http://www.technicolor.com), which in- (from extremely violent movies to movies without violence). volves helping users choose movies that are suitable for chil- The content extractable from DVDs consists of information dren in their family. The movies should be suitable in terms from different modalities, namely, visual information, audio of their violent content, e.g., for viewing by users’ fami- signals and subtitles, and any additional metadata present lies. Users select or reject movies by previewing parts of in the DVDs. From these 25 movies, 18 are dedicated to the the movies (i.e., scenes or segments) that include the most training process: Armageddon, Billy Elliot, Eragon, Harry violent moments. In the literature, the detection of violence Potter 5, I am Legend, Leon, Midnight Express, Pirates of was not a lot studied [2, 1, 3], until recently when it has the Caribbean 1, Reservoir Dogs, Saving Private Ryan, The gained interest. As most of the proposed methods suffer Sixth Sense, the Wicker Man, Kill Bill 1, The Bourne Iden- from a lack of a common and consistent database, and usu- tity, the Wizard of Oz, Dead Poets Society, Fight Club and ally use a limited developement set, the task was launched Independance Day. The remaining 7 movies, Fantastic Four, to propose a public and common framework for the research Fargo, Forrest Gump, Legally Blond, Pulp Fiction, The God community. This year, among other changes, two definitions Father 1 and The Pianist, will serve as the evaluation set. of violence will be studied, an objective one and a subjective As in 2011 and 2012, we tried to respect the genre repar- one (see below). The addition of a subjective definition was tition (from extremely violent to non violent) both in the motivated by the fact that the one from 2012 has proven to training and evaluation sets. lead to annotations which do not correspond to the use case. 2. TASK DESCRIPTION 4. GROUND TRUTH The task requires participants to deploy multimodal features The ground truth1 was created by several human assessors. to automatically detect portions of movies containing violent In addition to segments containing physical violence (with material. For 2013, two definitions of violence are studied. the two above definitions), annotations also include high- level concepts for the visual and the audio modalities. Each ∗This year, work has been supported, in part, by the Quaero 1 Program http://www.quaero.org/. The annotations, shot detections and key frames for this task were made available by the Fudan University, the Viet- Copyright is held by the author/owner(s). nam University of Science, and Technicolor. Any publica- MediaEval 2013 Workshop, October 17-19, 2013, Barcelona, Spain tion using these data should acknowledge these institutions’ contributions. annotated violent segment contains only one action, when- extraction, whereas in the second one, additional external ever it is possible. In the cases where different actions are data (e.g., extracted from the web) can be used. For the overlapping, the whole segment is proposed with different two segment-level runs, participants are required to, inde- actions. This was indicated in the annotation files by adding pendently of shot boundaries, provide violent segments for the tag “multiple action scene”. Each violent segment is an- each test movie. Once again, confidence scores should be notated at frame level, i.e., it is defined by its starting and added for each segment. Similarly to the shot-level runs, ending video frame numbers. the two segment-level runs differ from the type of data al- lowed for the classification: internal data from the DVDs Seven visual and three audio concepts are provided: presence only vs. internal data plus additional external data. In all of blood, fights, presence of fire, presence of guns, presence cases, confidence scores are compulsory, as they will be used of cold weapons, car chases and gory scenes (for the video for the evaluation metric. They will also allow to plot detec- modality); presence of screams, gunshots and explosions (for tion error trade-off curves which should be of great interest the audio modality). Participants should note that they are to analyze and compare the different techniques. For both welcome to carry out detection of the high-level concepts subtasks, i.e., both violence definitions, the required run will themselves. However, concept detection is not the goal of be the run at shot-level without external data. the task and these high-level concept annotations are only provided for training purposes and only on the training set. As a first step towards a qualitative evaluation, participants For the video concepts, each of them follows the same an- are encouraged to present at the MediaEval workshop a notation format as for violent segments, i.e., starting and video summary of the most violent scenes found by their al- ending frame numbers and possibly some additional tags. gorithms. This will not be evaluated by the organizers this Regarding blood annotations, a proportion of the amount year, but it will serve as a first basis for future evolution of of blood in each segment is provided by the following tags: the task. unnoticeable, low, medium and high. Four different types of fights are annotated: only two people fighting, a small group 6. EVALUATION CRITERIA As in 2012, the official evaluation metric will be the mean of people (roughly less than 10), large group of people (more average precision at the N top ranked violent shots. Several than 10), distant attack (i.e., no real fight but somebody is performance measures will be used for diagnostic purposes shot or attacked at distance). As for the presence of fire, (false alarm and miss detection rates, AED-precision and anything from big fires and explosions to fire coming out of a recall as defined in [4], the MediaEval cost, which is a func- gun while shooting, a candle, a cigarette lighter, a cigarette, tion weighting false alarms (FA) and missed detections (MI), or sparks was annotated, e.g., a space shuttle taking off also etc.). To avoid only evaluating systems at given operating generates fire and thus receives a fire label. An additional points and enable full comparison of the pros and cons of tag may indicate special colors of the fire (i.e., not yellow each system, we will use detection error trade-off (DET) or orange). If a segment of video showed the presence of curves, plotting Pfa as a function of Pmiss given a segmen- firearms (or cold weapons) it was annotated by any type of tation and a score for each segment, where the higher the (parts of) guns (or cold weapons) or assimilated arms. By score, the more likely the violence. Pfa and Pmiss are re- “cold weapon”, we mean any weapon that does not involve spectively the FA and MI rates given the system’s output fire or explosions as a result from the use of gun powder or and the reference annotation. In the shot classification, the other explosive materials. Annotations of gory scenes are FA and MI rates were calculated on a per shot basis while, more delicate. In the present task, they are indicated by in the segment level run, they were computed on a per unit graphic images of bloodletting and/or tissue damage. This of time basis, i.e., durations of both references and detected includes horror or war representations. As this is also a sub- segments are compared. Note that in the segment level run, jective and difficult notion to define, some additional seg- DET curves are possible only for systems returning a dense ments showing really disgusting mutants or creatures are segmentation (a list of segments that spans the entire video). annotated as gore. In this case, additional tags describing Segments not in the output list will be considered as non vi- the event/scene are added. For the audio concepts, each olent for all thresholds. temporal segment is annotated with its starting and end- ing times in seconds, and an additional tag corresponding 7. REFERENCES to the type of event, chosen from the list: nothing, gun- [1] B. Ionescu, J. Schlüter, I. Mironica, and M. Schedl. A shot, canon fire, scream, scream effort, explosion, multiple naive mid-level concept-based fusion approach to actions, multiple actions canon fire, multiple actions scream violence detection in hollywood movies. In ICMR, effort. Automatically generated shot boundaries with their pages 215–222, 2013. corresponding key frames are also provided with each movie. [2] C. Penet, C.-H. Demarty, G. Gravier, and P. Gros. Shot segmentation was carried out by Technicolor’s software. Multimodal Information Fusion and Temporal 5. RUN DESCRIPTION Integration for Violence Detection in Movies. In Participants can submit four types of runs: two of them ICASSP, Kyoto, Japon, 2012. are shot-classification runs and the others are segment-level [3] F. D. M. d. Souza, G. C. Chavez, E. A. d. Valle Jr., runs. For the two shot-classification runs, participants are and A. d. A. Araujo. Violence detection in video using required to provide violent scene detection at the shot level, spatio-temporal features. In SIBGRAPI ’10, pages according to the provided shot boundaries. Each shot has 224–230, Washington, DC, USA, 2010. to be classified as violent or non violent, with a confidence [4] A. Temko, C. Nadeu, and J.-I. Biel. Acoustic Event score. These two runs will differ in the data that can be used Detection: SVM-Based System and Evaluation Setup for the classification: for the first one, only the content of in CLEAR’07. In Multimodal Technologies for the movie extractable from the DVDs is allowed for feature Perception of Humans, pages 354–363. 2008.