<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>FAR at MediaEval 2013 Violent Scenes Detection: Concept-based Violent Scenes Detection in Movies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mats Sjöberg</string-name>
          <email>mats.sjoberg@aalto.fi</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bogdan Ionescu,</string-name>
          <email>bionescu@imag.pub.ro</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jan Schlüter</string-name>
          <email>jan.schlueter@ofai.at</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Markus Schedl</string-name>
          <email>markus.schedl@jku.at</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aalto University</institution>
          ,
          <addr-line>Espoo</addr-line>
          ,
          <country country="FI">Finland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Austrian Research Institute, for Artificial Intelligence</institution>
          ,
          <addr-line>Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Johannes Kepler University</institution>
          ,
          <addr-line>Linz</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University Politehnica of</institution>
          ,
          <addr-line>Bucharest</addr-line>
          ,
          <country country="RO">Romania</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>The MediaEval 2013 A ect Task challenged participants to automatically nd violent scenes in a set of popular movies. We propose to rst predict a set of mid-level concepts from low-level visual and auditory features, then fuse the concept predictions and features to detect violent content. We deliberately restrict ourselves to simple general-purpose features with limited temporal context and a generic neural network classi er. The system used this year is largely based on the one successfully employed by our group in the 2012 task, with some improvements based on our experience from last year. Our best-performing run with regard to the o cial metric received a MAP@100 of 49.6%.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The MediaEval 2013 A ect Task [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] challenged
participants to develop algorithms for nding violent scenes in
popular movies from DVD content based on video, audio
and subtitles. The organizers provided a training set of 18
movies with frame-wise annotations of segments containing
physical violence as well as several violence-related concepts
(e.g. blood or re), and a test set of 7 unannotated movies.
      </p>
      <p>
        The system used by our group this year is largely based on
the one successfully employed by us in the 2012 edition of the
violent scenes detection task [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In this year we have tried
new descriptor combinations, and tweaked the neural
network training parameters based on experiments performed
with the 2012 task setup.
      </p>
    </sec>
    <sec id="sec-2">
      <title>METHOD</title>
      <p>
        Our system builds on a set of visual and auditory features,
employing the same type of classi er at di erent stages to
obtain a violence score for each frame of an input video. The
setup is largely the same as in 2012 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
2.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Feature set</title>
      <p>
        Visual (93 dimensions): For each video frame, we extract
an 81-dimensional Histogram of Oriented Gradients (HoG),
an 11-dimensional Color Naming Histogram [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and a
visual activity value. The latter is obtained by lowering the
threshold of the cut detector in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] such that it becomes
overly sensitive, then counting the number of detections in
a 2-second time window centered on the current frame.
      </p>
      <p>
        Auditory (98 dimensions): In addition, we extract a set
of low-level auditory features as used by [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]: Linear
Predictive Coe cients (LPCs), Line Spectral Pairs (LSPs),
MelFrequency Cepstral Coe cients (MFCCs), Zero-Crossing
Rate (ZCR), and spectral centroid, ux, rollo , and kurtosis,
augmented with the variance of each feature over a
halfsecond time window. We use frame sizes of 40 ms without
overlap to make alignment with the 25-fps video frames
trivial.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Classifier</title>
      <p>For classi cation, we use multi-layer perceptrons with a
single hidden layer of 512 units and one or multiple output
units. All units use the logistic sigmoid transfer function.</p>
      <p>We normalize the input data by subtracting the mean and
dividing by the standard deviation of each input dimension.</p>
      <p>Training is performed by backpropagating cross-entropy
error, using random dropouts to improve generalization. We
follow the dropout scheme of [2, Sec. A.1] with minor
modications: all weights are initialized to zero, mini-batches are
900 samples, the learning rate starts at 5:0, momentum is
increased from 0:45 to 0:9 between epochs 10 and 20 and we
train for 100 epochs only. These settings worked well in
experiments with the 2012 training/testing split. In particular
we increased the learning rate from what was used in 2012,
because it improved performance.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Fusion scheme</title>
      <p>
        As last year [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], we use the concept annotations as a
stepping stone for predicting violence: We train a separate
classi er for each of 10 di erent concepts on the visual, auditory
or both feature sets, then train the nal violence predictor
using both feature sets and all concept predictions as inputs.
For comparison, we also train classi ers to predict violence
just from the features or just from the concepts.
      </p>
    </sec>
    <sec id="sec-6">
      <title>EXPERIMENTAL RESULTS</title>
    </sec>
    <sec id="sec-7">
      <title>Concept prediction</title>
      <p>
        For the training set of 18 movies, each video frame was
annotated with the 10 di erent concepts as detailed in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
We divide the concepts into visual, auditory and audiovisual
categories, depending on which low-level feature domains we
think are relevant for each. Next, we train and evaluate a
neural network for each of the concepts, employing
leaveone-movie-out cross-validation. The evaluation results are
very similar to our experiments in 2012 [4, Sec. 3.1], which
is not surprising since the training set has only been
supplemented with 20% new movies. For example, rearms and
re perform well, while carchase performs badly.
3.2
      </p>
    </sec>
    <sec id="sec-8">
      <title>Violence prediction</title>
      <p>Next, we train a frame-wise violence predictor, using
visual and auditory low-level features, as well as the concept
predictions, as input. Training requires inputs that are
similar to those that will be used in the testing phase, thus using
the concept ground-truth for training will not work. Instead
we use the concept prediction cross-validation outputs on
the training set (see previous section) as a more realistic
input source { in this way the system can learn which concept
predictors to rely on.
3.3</p>
    </sec>
    <sec id="sec-9">
      <title>Evaluation results</title>
      <p>We submitted ve runs for sub task 1, i.e., the objective
violence de nition. Due to time constraints we were not able
to prepare any runs for sub task 2 which used the subjective
violence de nition. One of our runs was a segment-level run
(run5), which forms segments of consecutive frames that our
predictor tagged as violent or non-violent. The remaining
four runs are shot-level (from run1 to run4), which use the
shot boundaries provided by the task organizers. For each
run, each partition (segment or shot) is assigned a violence
score corresponding to the highest predictor output for any
frame within the segment. The segments are then tagged as
violent or non-violent depending on whether their violence
score exceeds a certain threshold. We used the same
thresholds as used by our system in 2012, which were determined
by cross-validation in the training set of that year.</p>
      <p>Table 1 details the results for all our runs. The rst ve
lines show our runs submitted to the o cial evaluation. The
rst four are shot-level runs, the fth our single
segmentlevel run. The next three lines are additional uno cial runs
that we evaluated ourselves. The second column indicates
which input features were used, 'a' for auditory, 'v' for
visual, and 'c' for concept predictions. The auditory features
achieved the highest MAP@100 result, with no gains being
provided by the other modalities.</p>
      <p>
        For our submissions we reused the thresholds from [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
Unfortunately, this gave a very imbalanced precision and recall
for the concept-only submission (run 2), making it di cult
to compare to our other runs. To better judge the relative
performance of our submissions, Table 1 reports precision,
recall and F-score for the threshold maximizing the F-score.
Under this metric, the combination of auditory features and
concept predictions gives the best result, but di erences
between most runs are quite small.
      </p>
      <p>Table 2 shows the movie speci c results for each of our
submitted shot-level runs. Despite the bad threshold on
run2 it performs very well on Pulp Fiction. The movie
\Legally Blond" had very few violent scenes and these were
hard to detect with any of our runs.
4.</p>
    </sec>
    <sec id="sec-10">
      <title>CONCLUSIONS</title>
      <p>Our results show that violence detection can be done fairly
well using general-purpose features and generic neural
network classi ers, without engineering domain-speci c features.
While auditory features give the best results, using mid-level
concepts can give small overall gains, and more pronounced
gains for particular movies.
5.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Demarty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Penet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Quang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jiang</surname>
          </string-name>
          .
          <article-title>The MediaEval 2013 A ect Task: Violent Scenes Detection</article-title>
          . In MediaEval 2013 Workshop, Barcelona, Spain, October
          <volume>18</volume>
          -19
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          .
          <article-title>Improving neural networks by preventing co-adaptation of feature detectors</article-title>
          . arXiv,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Buzuloiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lambert</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Coquin</surname>
          </string-name>
          .
          <article-title>Improved Cut Detection for the Segmentation of Animation Movies</article-title>
          . In IEEE ICASSP, France,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          , J. Schluter, I. Mironica, and
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          .
          <article-title>A naive mid-level concept-based fusion approach to violence detection in hollywood movies</article-title>
          .
          <source>In Proceedings of the 3rd ACM conference on International conference on multimedia retrieval</source>
          ,
          <source>ICMR '13</source>
          , pages
          <fpage>215</fpage>
          {
          <fpage>222</fpage>
          , New York, NY, USA,
          <year>2013</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xie</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Meng</surname>
          </string-name>
          .
          <article-title>Classi cation of music and speech in mandarin news broadcasts</article-title>
          .
          <source>In Proc. of the 9th Nat. Conf. on Man-Machine Speech Communication (NCMMSC)</source>
          , Huangshan, Anhui, China,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>J. van de Weijer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Schmid</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Verbeek</surname>
            , and
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Larlus</surname>
          </string-name>
          .
          <article-title>Learning color names for real-world applications</article-title>
          .
          <source>IEEE Trans. on Image Processing</source>
          ,
          <volume>18</volume>
          (
          <issue>7</issue>
          ):
          <volume>1512</volume>
          {
          <fpage>1523</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>