<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>RECOD at MediaEval 2014: Violent Scenes Detection Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sandra Avilaz</string-name>
          <email>sandra@dca.fee.unicamp.br</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Moreiray</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mauricio Perezy</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Moraesx</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Isabela Cotay</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vanessa Testonix</string-name>
          <email>vanessa.t@samsung.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eduardo Vallez</string-name>
          <email>dovalle@dca.fee.unicamp.br</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Siome Goldensteiny</string-name>
          <email>siome@ic.unicamp.br</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anderson Rochay</string-name>
          <email>anderson.rocha@ic.unicamp.br</email>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>This paper presents the RECOD approaches used in the MediaEval 2014 Violent Scenes Detection task. Our system is based on the combination of visual, audio, and text features. We also evaluate the performance of a convolutional network as a feature extractor. We combined those features using a fusion scheme. We participated in the main and the generalization tasks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>The objective of the MediaEval 2014 Violent Scenes
Detection task is to automatically detect violent scenes in movies
and web videos. The targeted violent scenes are those \one
would not let an 8 years old child see in a movie because they
contain physical violence".</p>
      <p>
        In this year, two di erent datasets were proposed: (i) a
set of 31 Hollywood movies, for the main task, and (ii) a
set of 86 short YouTube web videos, for the generalization
task. The training data is the same for both subtasks. A
detailed overview of the datasets and the subtasks can be
found in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>In the following, we brie y introduce our system and
discuss our results1.
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>SYSTEM DESCRIPTION</title>
    </sec>
    <sec id="sec-3">
      <title>Visual Features</title>
      <p>
        In low-level visual feature extraction, we extract SURF
descriptors [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. For that, we rst apply the FFmpeg
software [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] to extract and resize the video frames. Low-level
visual descriptors are extracted on a dense spatial grid at
multiple scales. Next, they are reduced using a PCA
algorithm.
      </p>
      <p>
        Besides that, in order to incorporate temporal
information, we compute dense trajectories and motion boundary
descriptors, according to [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Again, for the sake of
process1There are some technical aspects which we cannot put
directly in the manuscript given we are patenting the
developed approach.
ing time, we decide to resize the video. Also, we reduce the
dimensionality of the video descriptors.
      </p>
      <p>In mid-level feature extraction, for each descriptor type,
we use a bag of visual words-based representation.</p>
      <p>
        Furthermore, we use a visual feature extractor based on
Convolutional Networks, which were trained on the
ImageNet 2012 training set [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. It has been chosen due to
its very competitive results on detection and classi cations
tasks. Additionally, as far as we know, deep learning
methods have not yet been employed in the MediaEval Violent
Scenes Detection task.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Audio Features</title>
      <p>
        Using the OpenSmile library [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], we extract several types
of audio features. A bag of visual words-based
representation is employed to quantize the audio features and a PCA
algorithm is also used to reduce the dimensionality of the
features.
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Text Features</title>
      <p>To represent the movie subtitles, we apply the bag of
words approach: the most common, simple and successful
document representation used so far. The bag of words
vector is normalized using a term's document frequency.</p>
      <p>Also, before creating the bag of words representation, we
remove the stop words and we apply a stemming algorithm
to reduce a word to its stem.
2.4</p>
    </sec>
    <sec id="sec-6">
      <title>Classification</title>
      <p>
        Classi cation is performed with Support Vector Machines
(SVM) classi ers, using the LIBSVM library [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Moreover,
classi cation is done separately for each descriptor. The
outputs of those individual classi ers are then combined at
the level of normalized scores. Our fusion strategy is done
by the combination of classi cation outcomes optimized on
the training set.
3.
      </p>
    </sec>
    <sec id="sec-7">
      <title>RUNS SUBMITTED</title>
      <p>In total, we generated 10 di erent runs: 5 runs for each
subtask. For main task (m), we have:
m1: 3 types of audio features + 3 types of visual
features (including a visual feature extractor based on
Convolutional Networks) + text features;
m2: 1 type of audio features + 3 types of visual
features (including a visual feature extractor based on</p>
      <sec id="sec-7-1">
        <title>Convolutional Networks) + text features;</title>
        <p>m3: 1 type of audio features + 3 types of visual
features (including a visual feature extractor based on
Convolutional Networks);
m4: 1 type of audio features + 2 types of visual
features + text features;
m5: 1 type of audio features.</p>
      </sec>
      <sec id="sec-7-2">
        <title>For generalization task (g), we have:</title>
        <p>g1: 3 types of audio features + 3 types of visual
features (including a visual feature extractor based on
Convolutional Networks);
g2: 1 type of audio features + 3 types of visual
features (including a visual feature extractor based on
Convolutional Networks);
g3: 1 type of audio features + 2 types of visual
features;
g4: 1 type of audio features;
g5: 1 type of visual features.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>RESULTS AND DISCUSSION</title>
      <p>MAP
0.204 0.477 0.337 0.567 0.188 0.479 0.378 0.376
0.239 0.459 0.308 0.348 0.362 0.465 0.431 0.373
0.189 0.545 0.277 0.465 0.212 0.418 0.489 0.371
0.115 0.319 0.209 0.270 0.159 0.502 0.167 0.249
0.373 0.301 0.307 0.423 0.175 0.308 0.317 0.315</p>
      <p>For the main task, our results are considerably below our
expectations (based on our training results). By
analyzing the results, we pointed out a crucial di erence between
training and test videos. In the Violent Scenes Detection
task, the participants are instructed in how to extract the
DVD data and convert it to MPEG format. For the sake
of saving disk space, we opted to convert the MPEG video
les to MP4 or to M4V. However, that choice introduced a
set of problems.</p>
      <p>First, with respect to the training data, we converted the
MPEG video les to MP4 or to M4V, depending on which
video container we were able to successfully synchronize
the extracted frames, regarding the numbers given by the
2Scenes are classi ed as violent or non-violent based on a
certain threshold.
groundtruth. Despite both containers store the video stream
in H.264 format, we did not notice that the M4V conversion
resulted in a di erent video aspect ratio (718 432 pixels).
Similarly, the audio encoding was also divergent: MP3
audio for MP4, while AAC audio for M4V. Next, due to frame
synchronization issue, we kept the test data in its original
format (MPEG-2, 720 576 pixels, with AC3 audio).
Therefore, we faced the problem of dealing with di erent aspect
ratios in training and testing data, as well as distinct audio
formats.</p>
      <p>For the generalization task, the problem is alleviated
because the test data is provided in MP4.</p>
      <p>Tables 3 reports the uno cial (u) results for main task
that we evaluated ourselves. Here, the results are obtained
by using the data (training and test sets) in MPEG format.
The rst column indicates which input features were used:
u1 for 1 type of audio features and u2 for text features.
Unfortunately, due to time constraints, we were not able to
prepare more runs.</p>
      <p>It should be mentioned rst that, the results for run u2,
are independent of the video format, since we directly
extracted the movie subtitles from DVD. For run u1, we can
notice a considerable improvement of classi cation
performance, from 0.315 (run m5) to 0.493 (run u1), con rming
the negative impact of using distinct audio formats. We are
currently investigating the impact on visual features.
u1
u2
8mil
0.351 0.601 0.636 0.530 0.521 0.352 0.463 0.493
0.402 0.237 0.407 0.345 0.232 0.277 0.188 0.298
5.</p>
    </sec>
    <sec id="sec-9">
      <title>ACKNOWLEDGMENTS</title>
      <p>This research was partially supported by FAPESP, CAPES,
CNPq and Project \Capacitaca~o em Tecnologia de
Informaca~o" nanced by Samsung Eletro^nica da Amazo^nia Ltda.,
using resources provided by the Informatics Law no. 8.248/91.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>[1] FFmpeg. http://www. mpeg.org/.</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.-C.</given-names>
            <surname>Chang</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.-J.</given-names>
            <surname>Lin</surname>
          </string-name>
          .
          <article-title>LIBSVM: A library for support vector machines</article-title>
          .
          <source>ACM TIST</source>
          ,
          <volume>2</volume>
          (
          <issue>3</issue>
          ):1{
          <fpage>27</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Eyben</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Wollmer, and</article-title>
          <string-name>
            <surname>B. Schuller.</surname>
          </string-name>
          <article-title>OpenSmile: the munich versatile and fast open-source audio feature extractor</article-title>
          .
          <source>In ACM Multimedia</source>
          , pages
          <volume>1459</volume>
          {
          <fpage>1462</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Lowe</surname>
          </string-name>
          .
          <article-title>Distinctive image features from scale-invariant keypoints</article-title>
          .
          <source>International Journal of Computer Vision (IJCV)</source>
          ,
          <volume>60</volume>
          :
          <fpage>91</fpage>
          {
          <fpage>110</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>O.</given-names>
            <surname>Russakovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Krause</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satheesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karpathy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Khosla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Berg</surname>
          </string-name>
          , and L.
          <string-name>
            <surname>Fei-Fei. ImageNet Large Scale</surname>
          </string-name>
          <article-title>Visual Recognition Challenge</article-title>
          .
          <source>arXiv:1409.0575</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sjo</surname>
          </string-name>
          berg,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Quang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Demarty</surname>
          </string-name>
          .
          <article-title>The MediaEval 2014 A ect Task: Violent Scenes Detection</article-title>
          . In MediaEval 2014 Workshop, Barcelona, Spain,
          <source>October</source>
          <volume>16</volume>
          {
          <fpage>17</fpage>
          2014.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Klaser, C. Schmid, and C.-L. Liu.
          <article-title>Dense trajectories and motion boundary descriptors for action recognition</article-title>
          .
          <source>Interna</source>
          ,
          <volume>103</volume>
          :
          <fpage>60</fpage>
          {
          <fpage>79</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>