<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The MediaEval 2014 Affect Task: Violent Scenes Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mats Sjöberg</string-name>
          <email>mats.sjoberg@aalto</email>
          <email>mats.sjoberg@aalto.</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vu Lam Quang</string-name>
          <email>lamquangvu@gmail.com</email>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bogdan Ionescu</string-name>
          <email>bionescu@imag.pub.ro</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Markus Schedl</string-name>
          <email>markus.schedl@jku.at</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yu-Gang Jiang</string-name>
          <email>yugang.jiang@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claire-Hélène Demarty</string-name>
          <email>@technicolor.com</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aalto University</institution>
          ,
          <addr-line>Espoo</addr-line>
          ,
          <country country="FI">Finland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Fudan University</institution>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Johannes Kepler University</institution>
          ,
          <addr-line>Linz</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Technicolor</institution>
          ,
          <addr-line>Rennes, France, claire-helene.demarty</addr-line>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>University Politehnica of</institution>
          ,
          <addr-line>Bucharest</addr-line>
          ,
          <country country="RO">Romania</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>University of Science</institution>
          ,
          <addr-line>VNU-HCMC</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>This paper provides a description of the MediaEval 2014 A ect Task: Violent Scenes Detection, which is running for the fourth year. The task originates from a use case at Technicolor1that aims to help users nd suitable contents from a movie database. We provide insights on the use case, task challenges, data set and ground truth, required and optional participant runs and evaluation metrics.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>The A ect Task: Violent Scenes Detection is part of the
MediaEval 2014 Benchmarking Initiative for Multimedia
Evaluation. The objective of the task is to automatically detect
violent segments in movies. This challenge is proposed for
the fourth year in the MediaEval benchmark. It derives
from a use case at Technicolor1 that involves helping
parents choosing movies that are suitable for their children with
respect to their violence contents. Parents decide to select
or reject movies after previewing the most violent parts of
the movies.</p>
      <p>
        In the literature, detection of violence in movies has been
marginally addressed until recently [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. As most of the
proposed methods su er from a lack of a consistent
evaluation which usually requires the use of a constrained and
closed dataset, the task's main objective is to propose a
public common evaluation framework for the research in this
area.
      </p>
      <p>This year we concentrate on a subjective de nition of
violence that is closer to the considered use case than the more
objective de nition used in the previous editions. Another
novelty is the addition of a new generalization task which
transposes the detection to short web video footage. The
idea is to assess how well approaches generalize to kinds of
video material other than typical Hollywood movies.
Usergenerated videos shared via on-line video platforms have
been strongly gaining in popularity during the past couple of
years. Taking such material into account is of vital interest
for future research.</p>
    </sec>
    <sec id="sec-2">
      <title>TASK DESCRIPTION</title>
      <p>The task requires participants to deploy multimedia
features to automatically detect movie segments that contain
violent material. Segments are regarded as arbitrary length
video time intervals, e.g., start { end frame. In contrast
to previous years, video shot segmentation is no longer
provided. Violence is being de ned as content which \one would
not let an 8 years old child see in a movie because it
contains physical violence". To solve the task, participants are
allowed to use either only features extracted from the
original movie DVDs, or to use also additional external data,
e.g., extracted from the web.
3.</p>
    </sec>
    <sec id="sec-3">
      <title>DATA DESCRIPTION</title>
      <p>Two di erent data sets are proposed: (i) a set of 31
Hollywood movies whose DVDs must be purchased by the
participants, for the main task and (ii) a set of 86 short YouTube2
web videos under Creative Commons licenses that allow
redistribution, for the generalization task.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Hollywood Movies</title>
      <p>The proposed movies are of di erent genres and show
different amounts of violence, from extremely violent movies
to movies without violence. From the DVDs, participants
can extract various information from di erent modalities,
namely: visual (frames), audio (soundtracks) and text
(subtitles and any additional metadata present in the DVDs).</p>
      <p>From these 31 movies, 24 are dedicated to the training
process: \Armageddon", \Billy Elliot", \Eragon", \Harry Potter
5", \I am Legend", \Leon", \Midnight Express", \Pirates of
the Caribbean 1", \Reservoir Dogs", \Saving Private Ryan",
\The Sixth Sense", \The Wicker Man", \The Bourne
Identity", \The Wizard of Oz", \Dead Poets Society", \Fight Club",
\Independence Day", \Fantastic Four 1", \Fargo", \Forrest
Gump", \Legally Blond", \Pulp Fiction", \The God Father 1"
and \The Pianist". The remaining 7 movies, \8 Mile",
\Braveheart", \Desperado", \Ghost in the Shell", \Jumanji",
\Terminator 2" and \V for Vendetta", will serve as the test set
for the actual benchmarking.
3.2</p>
    </sec>
    <sec id="sec-5">
      <title>Web Videos</title>
      <p>For the generalization task, we gathered 86 videos from
YouTube, which are indicated by uploaders to fall under a
Creative Commons license (total duration ca. 157 minutes).
They vary in length between 6 seconds and 6 minutes. The
dataset contains both violent and non-violent videos, from
very diverse categories: video games, amateur videos of
accidents, sport events, etc. Videos were retrieved with search
queries re ecting violence, such as \killing video games" or
1http://www.technicolor.com/
2http://www.youtube.com/
\brutal accident". Results were then ltered for Creative
Commons and short duration clips, and the nal videos were
manually selected from the remaining results. Along with
the actual videos, we provide participants with a variety
of metadata from YouTube, including YouTube-ID, upload
date, title, description, keywords, duration, view counts,
ratings, likes and dislikes.</p>
      <p>This kind of video material is particularly challenging due
to factors such as bad quality in general and worse
quality than Hollywood movies, presence of di erent languages,
overlay text, black framing of the actual frames, or other
modi cations of the raw video content.
4.</p>
    </sec>
    <sec id="sec-6">
      <title>GROUND TRUTH</title>
      <p>
        This year ground truth (for test set and generalization
task) was created by several human assessors3 who followed
the subjective de nition of violence introduced in Section 2.
The training data annotations (i.e., 24 movies) are the ones
from the previous edition of the task [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This year's
annotations consisted in the following protocol. Firstly, all the
videos were annotated separately by two groups of
annotators from two di erent countries. For each group, regular
annotators labeled all the videos which were then reviewed
by master annotators. Regular annotators were graduate
students (typically single with no children) and master
annotators were senior researchers (typically married with
children). No discussions were held between annotators during
the annotation process. Group 1 used 2 regular annotators
and 1 master annotator. Group 2 used 5 regular annotators
and 3 master annotators. Annotators labeled di erent sets
of movies. In the end, each movie received 2 di erent
annotations which were then merged by the master annotators.
Secondly, the achieved annotations from the two groups were
merged and reviewed once more by the task organizers. All
the uncertain, e.g., borderline, cases were solved via panel
discussions, involving di erent people from di erent
countries, to avoid cultural bias in the annotations. A textual
description was added to each segment to re ect the choices
of the annotators. Each annotated violent segment contains
only one action, whenever it is possible. In the cases where
di erent actions are overlapping, the whole segment is
proposed with di erent actions. This was indicated in the
annotation les by adding the tag \multiple action scene". Each
violent segment is annotated at frame level, i.e., it is de ned
by its starting and ending video frame numbers.
      </p>
      <p>
        In addition to segments containing physical violence,
annotations also include high-level concepts for the visual and
audio modalities of the rst 17 Hollywood movies in the
training set. Seven visual concepts (\presence of blood",
\ ghts", \presence of re", \presence of guns", \presence of
cold weapons", \car chases" and \gory scenes") and three
audio concepts (\presence of screams", \gunshots" and
\explosions") are provided. These are the concepts proposed in
the previous editions of the task, see [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
    </sec>
    <sec id="sec-7">
      <title>RUN DESCRIPTION</title>
      <p>This year, there are two subtasks: the (i) main task, and
the (ii) generalization task. In the main task participants are
required to detect violence in the 7 Hollywood movies which
serve as the test set. In the generalization task, participants
3annotations were made available by Fudan University,
Vietnam University of Science, and Technicolor. Any publication
using these data should acknowledge these institutions.
are expected to use the same systems as for the main task,
but this time to detect violence in the 86 YouTube videos
provided by the organizers. The training data is the same
for both subtasks.</p>
      <p>Participants can submit two types of runs for each
subtask: generated using o cial training data only, or using
external sources (e.g., Internet). In all runs, participants
are required to provide the violent segments by specifying
the starting and ending time of each segment together with
a con dence score (the higher the value, the more likely that
the segment is violent).
6.</p>
    </sec>
    <sec id="sec-8">
      <title>EVALUATION CRITERIA</title>
      <p>The o cial evaluation metric is the Mean Average
Precision (MAP). In addition to this, for comparison reasons,
metrics from the previous editions of the task will be
computed as well, e.g., false alarm and miss detection rates,
AED-precision and recall, the MediaEval cost, which is a
function weighting false alarms (FA) and missed detections
(MI), etc. To avoid evaluating systems only at a given
operating point and enable full comparison of the pros and
cons of each system, we use detection error trade-o (DET)
curves, plotting the false reject rate as a function of the false
positive rate, given a violence con dence score for each
segment. The false reject and false positive rates are calculated
on a per unit of time basis, i.e., durations of both references
and detected segments are compared. Segments not in the
output list are considered as non-violent.
7.</p>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSIONS</title>
      <p>The A ect Task: Violent Scenes Detection provides
participants with a comparative and collaborative evaluation
framework for violence detection in movies. This year in
particular, the task explores also the generalization of such
systems to web footage. Details on the methods and results
of each individual team can be found in the working note
papers of the participating teams in these proceedings.</p>
    </sec>
    <sec id="sec-10">
      <title>Acknowledgments</title>
      <p>This task is supported by the following projects: Academy of
Finland grants 255745 and 251170, UEFISCDI SCOUTER
grant 28DPST/30-08-2013, National Natural Science
Foundation of China grants 61201387 and 61228205, China's
National 973 Program 2010CB327900, Vietnam National
University Ho Chi Minh City grant B2013-26-01, Austrian
Science Fund P25655 and EU FP7-ICT-2011-9 project 601166.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Cedric</given-names>
            <surname>Penet</surname>
          </string-name>
          ,
          <string-name>
            <surname>Claire-Helene</surname>
            <given-names>Demarty</given-names>
          </string-name>
          , Guillaume Gravier, and Patrick Gros, \
          <article-title>Multimodal Information Fusion and Temporal Integration for Violence Detection in Movies"</article-title>
          ,
          <source>IEEE ICASSP</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Bogdan</given-names>
            <surname>Ionescu</surname>
          </string-name>
          , Jan Schluter, Ionut Mironica, and Markus Schedl, \
          <article-title>A Naive Mid-level Concept-based Fusion Approach to Violence Detection in Hollywood Movies"</article-title>
          ,
          <source>ACM ICMR</source>
          , pp.
          <fpage>215</fpage>
          -
          <lpage>222</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Esra</given-names>
            <surname>Acar</surname>
          </string-name>
          , Frank Hopfgartner, and Sahin Albayrak, \
          <article-title>Violence Detection in Hollywood Movies by the Fusion of Visual and Mid-level Audio Cues"</article-title>
          , ACM Multimedia, pp.
          <fpage>717</fpage>
          -
          <lpage>720</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Claire-Helene</surname>
            <given-names>Demarty</given-names>
          </string-name>
          , Cedric Penet, Markus Schedl, Bogdan Ionescu, Vu Lam Quang, and
          <string-name>
            <surname>Yu-Gang</surname>
            <given-names>Jiang</given-names>
          </string-name>
          , \
          <article-title>The MediaEval 2013 A ect Task: Violent Scenes Detection"</article-title>
          ,
          <source>CEUR-WS</source>
          , Vol.
          <volume>1043</volume>
          , http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>1043</volume>
          / mediaeval2013_submission_4.pdf, Spain,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>