<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Technicolor/INRIA team at the MediaEval 2013 Violent Scenes Detection Task∗</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Cédric Penet, Claire-Hélène Demarty</string-name>
          <email>cedric.penet@technicolor.com claire-helene.demarty@technicolor.com</email>
          <email>claire-helene.demarty@technicolor.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guillaume Gravier, Patrick Gros</string-name>
          <email>Patrick.Gros@inria.fr</email>
          <email>guig@irisa.fr Patrick.Gros@inria.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CNRS/IRISA &amp; INRIA Rennes, Campus de Beaulieu</institution>
          ,
          <addr-line>35042 Rennes</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Technicolor/INRIA Rennes &amp; Technicolor</institution>
          ,
          <addr-line>1 ave de Belle Fontaine, 35510 Cesson-Sévigné</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>This paper presents the work done at Technicolor and INRIA regarding the MediaEval 2013 Violent Scenes Detection task, which aims at detecting violent scenes in movies. We participated in both the objective and the subjective subtasks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        ∗We would like to acknowledge the MediaEval Multimedia
Benchmark http://www.multimediaeval.org/ and in
particular the Violent Scenes Detection Task 2013 for providing
the data used in this research. This work was partly achieved
as part of the Quaero program, funded by Oseo, french state
agency for innovation.
each sample is represented using both its own words, and
the words of the n samples before and the n samples after.
Trained for the detection of screams, gunshots and
explosions, our system has proved to be comparable to the
stateof-the-art. For each sample si, we obtain P (si ∈ ck), ∀ k ∈
{gunshots, explosions, screams, others}, where others
corresponds to all that is not screams, gunshots or explosions. For
more insight, please read [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ].
      </p>
      <p>Segment-based audio concepts decision:
Once the probabilities have been estimated, decisions are
made using two types of methods. First, a single decision
variable is extracted, taking, for each sample, the value of
the class that has the highest probability. Second, in order
to allow more flexibility in the system, four binary decision
variables are extracted, one for each class. The binary
variables are set to one if the probability of a sample of belonging
to the corresponding class is higher than 30 %. This allows
the segments samples to be detected as belonging to several
classes at once.</p>
      <p>
        Segment-based video concepts/features detectors:
The video features used in this system are the same as what
we used for run #3 of last year’s task [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]: shot length, three
color harmonisation based features, color coherence,
bloodcolor proportion, flash detection, fire detection, motion
intensity, average luminance. These features are all
aggregated on the same variable segments as for the audio. For
the image based features, aggregation is performed via
avRun #1
Run #2
Run #3
Run #4
33.82 %
12.02 %
13.17 %
22.48 %
12.47 %
53.59 %
34.00 %
30.22 %
44.79 %
      </p>
      <p>
        18.81 %
3. RUNS SUBMITTED
In this section, we present the runs that we submitted for
this year’s task. The first three runs are based on our new
multimodal system, for which several configurations were
chosen using cross-validation experimentation for both the
objective and the subjective subtasks. The fourth run
corresponds to run #3 of our participation in last year’s
benchmark [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In order to evaluate the stability of this previous
system, we re-used the exact same audio and video models,
without retraining the parameters.
      </p>
      <p>Run #1: Audio only
The first run is audio only. Audio concepts detection is
performed using a context window of size n = 1 and violence
detection using a single decision variable is performed using
a context window of size n = 5. In addition, we trained
different classifiers on each audio features type, resulting
in several audio concept detectors. We then performed late
fusion of these classifiers for violence detection using optimal
weights fusion. Only shot-level runs obtained through shot
aggregation have been submitted for the objective subtask,
while both shot and chunk aggregation were used for the
subjective one.</p>
      <p>Run #2: Multimodal early fusion
This run is an early fusion of audio concepts and video
concepts. The audio concepts provided are the audio concepts
extracted from each type of audio features. They are
extracted using a context window of size n = 5. Violence
detection is then performed using a context window of size
n = 5. For the objective subtask, four audio binary nodes
corresponding to the four audio concepts decisions are used
per features type, while only one is used for the subjective
subtask. Only shot aggregation have been submitted for
both subtasks.</p>
      <p>Run #3: Multimodal late fusion
This run is equivalent to run #2, the main difference
being that instead of early fusion, late fusion through a naive
Bayesian network is used. Shot and chunk aggregation have
been submitted for both subtasks.</p>
      <p>Run #4: last year’s models
This run has only been submitted with shot aggregation for
the objective subtask.
4. RESULTS AND DISCUSSION
The obtained results are reported in table 1 for each
submitted run and both subtasks. It must be noted first that,
compared to last year’s results, where our best system achieved
about 62 % in term of MAP@100, this year’s results are
much lower. Our best MAP@100 for the objective
definition is 33.82 %. Moreover, our best achievement is obtained
with run #1, and the results we obtained for run #2, #3
and #4 are very poor in comparison. This contradicts the
results that were previously obtained about multimodality,
especially for run #4, which is a reuse of last year’s models.
We think this is an indication of a flaw in our multimodal
protocol. However, this may also indicate that last year’s
results obtained on a set of only three movies were may be
overly optimistic.</p>
      <p>It must also be noted that the results obtained for the
subjective definition are much higher than for the objective
definition, which indicates that the subjective definition might
be less variable than the objective one. Another reason for
this may be that globally the duration for subjective
violence in the ground truth is bigger than the duration for the
objective one. It is also interesting that the results obtained
at chunk level are slightly lower than the results obtained
at shot level, indicating the importance of temporal
integration: the more the system is integrated, the better the
results are.</p>
      <p>Finally, we obtain good results in terms of recall and
precision for most of our runs and for the both violence
definitions. For the objective definition, apart from run #4 (2012
system), our shot-level runs reach more than 80 % recall and
20 % precision (runs #2 and #3 even reach recall values of
90 % and 87 % respectively). The subjective definition yields
equivalent recall rates, and improved precision rates, up to
30 %, for runs at shot level.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>C.-H. Demarty</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Penet</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Schedl</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Ionescu</surname>
            ,
            <given-names>V. L.</given-names>
          </string-name>
          <string-name>
            <surname>Quang</surname>
            , and
            <given-names>Y.-G.</given-names>
          </string-name>
          <string-name>
            <surname>Jiang. The MediaEval 2013 Affect Task</surname>
          </string-name>
          <article-title>: Violent Scenes Detection</article-title>
          .
          <source>In MediaEval Workshop</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          , J. Schlu¨ter, I. Mironicˇa, and
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          .
          <article-title>A Na¨ıve Mid-level Concept-based Fusion Approach to Violence Detection in Hollywood Movies</article-title>
          .
          <source>In ACM ICMR</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Penet</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Demarty</surname>
            , G. Gravier, and
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Gros</surname>
          </string-name>
          .
          <article-title>Audio Event Detection in Movies using Multiple Audio Words and Contextual Bayesian Networks</article-title>
          .
          <source>In CBMI</source>
          ,
          <year>June 2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Penet</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Demarty</surname>
            , G. Gravier, and
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Gros</surname>
          </string-name>
          .
          <article-title>Variability Modelling for Audio Events Detection in Movies</article-title>
          . MTAP - Special Issue on CBMI,
          <year>2013</year>
          . Submitted to.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Penet</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Demarty</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Soleymani</surname>
            , G. Gravier, and
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Gros</surname>
          </string-name>
          . Technicolor/INRIA/Imperial College London at the MediaEval 2012
          <article-title>Violent Scene Detection Task</article-title>
          .
          <source>In MediaEval 2012 Workshop</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schlu</surname>
          </string-name>
          ¨ter,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Mironicˇa</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          . ARF @
          <article-title>MediaEval 2012: An Uninformed Approach to Violence Detection in Hollywood Movies</article-title>
          .
          <source>In MediaEval 2012 Workshop</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>