<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The MediaEval 2013 Affect Task: Violent Scenes Detection∗</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Claire-Hélène Demarty</string-name>
          <email>helene.demarty@technicolor.com</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bogdan Ionescu</string-name>
          <email>bionescu@imag.pub.ro</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cédric Penet</string-name>
          <email>cedric.penet@technicolor.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vu Lam Quang</string-name>
          <email>lamquangvu@gmail.com</email>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Markus Schedl</string-name>
          <email>markus.schedl@jku.at</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yu-Gang Jiang</string-name>
          <email>yugang.jiang@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fudan University</institution>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Johannes Kepler University</institution>
          ,
          <addr-line>Linz</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Technicolor</institution>
          ,
          <addr-line>Rennes</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Technicolor</institution>
          ,
          <addr-line>Rennes, France, claire</addr-line>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>University Polytehnica of</institution>
          ,
          <addr-line>Bucharest</addr-line>
          ,
          <country country="RO">Romania</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>University of Science</institution>
          ,
          <addr-line>VNU-HCMC</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>17</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>This paper provides a description of the MediaEval 2013 Affect Task Violent Scenes Detection. This task, which is proposed for the third year to the research community, derives directly from a Technicolor use case which aims at easing a user's selection process from a movie database. This task will therefore apply to movie content. We provide some insight into the Technicolor use case, before giving details on the task itself, which has seen some changes in 2013. Dataset, annotations, and evaluation criteria as well as the required and optional runs are described.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>4. GROUND TRUTH</title>
      <p>The ground truth1 was created by several human assessors.
In addition to segments containing physical violence (with
the two above definitions), annotations also include
highlevel concepts for the visual and the audio modalities. Each
1The annotations, shot detections and key frames for this
task were made available by the Fudan University, the
Vietnam University of Science, and Technicolor. Any
publication using these data should acknowledge these institutions’
contributions.
annotated violent segment contains only one action,
whenever it is possible. In the cases where different actions are
overlapping, the whole segment is proposed with different
actions. This was indicated in the annotation files by adding
the tag “multiple action scene”. Each violent segment is
annotated at frame level, i.e., it is defined by its starting and
ending video frame numbers.</p>
      <p>Seven visual and three audio concepts are provided: presence
of blood, fights, presence of fire, presence of guns, presence
of cold weapons, car chases and gory scenes (for the video
modality); presence of screams, gunshots and explosions (for
the audio modality). Participants should note that they are
welcome to carry out detection of the high-level concepts
themselves. However, concept detection is not the goal of
the task and these high-level concept annotations are only
provided for training purposes and only on the training set.
For the video concepts, each of them follows the same
annotation format as for violent segments, i.e., starting and
ending frame numbers and possibly some additional tags.
Regarding blood annotations, a proportion of the amount
of blood in each segment is provided by the following tags:
unnoticeable, low, medium and high. Four different types of
fights are annotated: only two people fighting, a small group
of people (roughly less than 10), large group of people (more
than 10), distant attack (i.e., no real fight but somebody is
shot or attacked at distance). As for the presence of fire,
anything from big fires and explosions to fire coming out of a
gun while shooting, a candle, a cigarette lighter, a cigarette,
or sparks was annotated, e.g., a space shuttle taking off also
generates fire and thus receives a fire label. An additional
tag may indicate special colors of the fire (i.e., not yellow
or orange). If a segment of video showed the presence of
firearms (or cold weapons) it was annotated by any type of
(parts of) guns (or cold weapons) or assimilated arms. By
“cold weapon”, we mean any weapon that does not involve
fire or explosions as a result from the use of gun powder or
other explosive materials. Annotations of gory scenes are
more delicate. In the present task, they are indicated by
graphic images of bloodletting and/or tissue damage. This
includes horror or war representations. As this is also a
subjective and difficult notion to define, some additional
segments showing really disgusting mutants or creatures are
annotated as gore. In this case, additional tags describing
the event/scene are added. For the audio concepts, each
temporal segment is annotated with its starting and
ending times in seconds, and an additional tag corresponding
to the type of event, chosen from the list: nothing,
gunshot, canon fire, scream, scream effort, explosion, multiple
actions, multiple actions canon fire, multiple actions scream
effort. Automatically generated shot boundaries with their
corresponding key frames are also provided with each movie.
Shot segmentation was carried out by Technicolor’s software.</p>
    </sec>
    <sec id="sec-2">
      <title>5. RUN DESCRIPTION</title>
      <p>Participants can submit four types of runs: two of them
are shot-classification runs and the others are segment-level
runs. For the two shot-classification runs, participants are
required to provide violent scene detection at the shot level,
according to the provided shot boundaries. Each shot has
to be classified as violent or non violent, with a confidence
score. These two runs will differ in the data that can be used
for the classification: for the first one, only the content of
the movie extractable from the DVDs is allowed for feature
extraction, whereas in the second one, additional external
data (e.g., extracted from the web) can be used. For the
two segment-level runs, participants are required to,
independently of shot boundaries, provide violent segments for
each test movie. Once again, confidence scores should be
added for each segment. Similarly to the shot-level runs,
the two segment-level runs differ from the type of data
allowed for the classification: internal data from the DVDs
only vs. internal data plus additional external data. In all
cases, confidence scores are compulsory, as they will be used
for the evaluation metric. They will also allow to plot
detection error trade-off curves which should be of great interest
to analyze and compare the different techniques. For both
subtasks, i.e., both violence definitions, the required run will
be the run at shot-level without external data.</p>
      <p>As a first step towards a qualitative evaluation, participants
are encouraged to present at the MediaEval workshop a
video summary of the most violent scenes found by their
algorithms. This will not be evaluated by the organizers this
year, but it will serve as a first basis for future evolution of
the task.</p>
    </sec>
    <sec id="sec-3">
      <title>6. EVALUATION CRITERIA</title>
      <p>
        As in 2012, the official evaluation metric will be the mean
average precision at the N top ranked violent shots. Several
performance measures will be used for diagnostic purposes
(false alarm and miss detection rates, AED-precision and
recall as defined in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the MediaEval cost, which is a
function weighting false alarms (FA) and missed detections (MI),
etc.). To avoid only evaluating systems at given operating
points and enable full comparison of the pros and cons of
each system, we will use detection error trade-off (DET)
curves, plotting Pfa as a function of Pmiss given a
segmentation and a score for each segment, where the higher the
score, the more likely the violence. Pfa and Pmiss are
respectively the FA and MI rates given the system’s output
and the reference annotation. In the shot classification, the
FA and MI rates were calculated on a per shot basis while,
in the segment level run, they were computed on a per unit
of time basis, i.e., durations of both references and detected
segments are compared. Note that in the segment level run,
DET curves are possible only for systems returning a dense
segmentation (a list of segments that spans the entire video).
Segments not in the output list will be considered as non
violent for all thresholds.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          , J. Schlu¨ter, I. Mironica, and
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          .
          <article-title>A naive mid-level concept-based fusion approach to violence detection in hollywood movies</article-title>
          .
          <source>In ICMR</source>
          , pages
          <fpage>215</fpage>
          -
          <lpage>222</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Penet</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Demarty</surname>
            , G. Gravier, and
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Gros</surname>
          </string-name>
          .
          <article-title>Multimodal Information Fusion and Temporal Integration for Violence Detection in Movies</article-title>
          . In ICASSP, Kyoto, Japon,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>F. D.</surname>
          </string-name>
          <article-title>M. d</article-title>
          . Souza,
          <string-name>
            <given-names>G. C.</given-names>
            <surname>Chavez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. A. d. Valle</given-names>
            <surname>Jr.</surname>
          </string-name>
          , and
          <string-name>
            <given-names>A. d. A.</given-names>
            <surname>Araujo</surname>
          </string-name>
          .
          <article-title>Violence detection in video using spatio-temporal features</article-title>
          .
          <source>In SIBGRAPI '10</source>
          , pages
          <fpage>224</fpage>
          -
          <lpage>230</lpage>
          , Washington, DC, USA,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Temko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Nadeu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.-I. Biel. Acoustic</given-names>
            <surname>Event</surname>
          </string-name>
          <article-title>Detection: SVM-Based System and Evaluation Setup in CLEAR'07</article-title>
          . In Multimodal Technologies for Perception of Humans, pages
          <fpage>354</fpage>
          -
          <lpage>363</lpage>
          .
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>