<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MTM at MediaEval 2013 Violent Scenes Detection: through acoustic-visual transform</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bruno do Nascimento Teixeira</string-name>
          <email>bruno.texeira@dcc.ufmg.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidade Fedederal de Minas Gerais Belo Horizonte</institution>
          ,
          <country country="BR">Brazil</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>This paper describes the team MTM participation in the MediaEval 2013 campaign. We submitted one run at shot level that explores spatial correlation between acoustic-visual features. The motion features are computed to represent the video.The Mel Frequency Cepstral Coefficients (MFCC) of the acoustic signal, and their first and second order derivatives are exploited to represent audio. One main issue in designing movie shot classification is considered. This issue is "there is a correlation between velocity and acceleration and the acoustic features". Our approach relies in find canonical bases, using Canonical Correlation Analysis (CCA), in order to represent video. We also add spatial information using frame regions. We evaluate the performance of our proposed method on MediaEval 2013 Violent Scenes Detection in film data. MediaEval 2013, violent scenes detection, canonical correlation analysis, Bayesian networks</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Two classes of violence can be considered: objective and
subjective violence. Objective violence is defined as "physical violence
or accident resulting in human injury or pain". More details about
the violence detection task can be found in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        We focus on objective violence and assume there is non-trivial
correlation between acoustic features and motion in objective violence
scenes. In this case, we explore the correlation between acoustic
and visual features. Canonical Correlation Analysis (CCA)
proposed by [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] maximizes the correlation between two multivariate
random vectors by finding a linear transforms wx and wy. CCA is
employed for identification and segmentation of moving-sounding
objects [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>METHOD</title>
      <p>The goal of the proposed work is to combine visual and
acoustic features by computing the canonical base vectors. A grid (see
Figure 1 (a)) is used to segment the frame and capture the spatial
information and build an acoustic transform map.</p>
    </sec>
    <sec id="sec-3">
      <title>Video Representation</title>
      <p>For each grid segment, Sx is computed using optical flow, where
xj = (xj1; xj2) and xj1, xj2 average velocity and acceleration
magnitude respectively for all pixels belonging to the same region. For
each audio frame, 12 MFCCs and their first and second derivates
are computed to build an acoustic vector yj = (y1j ; y2; :::; y3j6).
j
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Canonical Correlation Analysis</title>
      <p>Consider a multivariate random vector of the form (x; y) and
a sample of instances S = ((x1; y1); :::; (xn; yn)) of (x; y), we
can project x and y onto directions wx and wy (x ! hwx; xi; y !
hwy; yi) to maximizes the correlation between Sxwx where Sxwx =
(hwx; x1i; :::; hwx; xni), and Sywy where Sywy = (hwy; y1i; :::
; hwy; yni):
=
=
max corr(Sxwx; Sywy)
wx;wy
max
wx;wy jjSxwxjjjjSywyjj</p>
      <p>hSxwx; Sywyi :</p>
      <p>
        The coordinate system that optimizes the correlation between
corresponding coordinates is found by solving the generalized
eigenvectors Ax = Bx [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ].
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Early Fusion</title>
      <p>Early fusion is performed by compute visual and acoustic
representation T based on the canonical basis wx for each region of the
frame, using Sx and Sy to maximizes the correlation :</p>
      <p>T = [wx1wx2:::wx25];
where T the feature vector composed, wxr is visual linear
transformation and r is the r-th region or grid position.
2.4</p>
    </sec>
    <sec id="sec-6">
      <title>Bayesian Network</title>
      <p>
        Bayes Net Toolbox for Matlab (BNT) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is used to train the
network with 2 nodes: class node C (violence and non-violence)
and the observed node T composed by T = [wx1wx2:::wx25] (see
Figure 1 (b)).
3.
      </p>
    </sec>
    <sec id="sec-7">
      <title>RESULTS</title>
      <p>Table 1 shows the global result of our approach on the
MediaEval 2013 affect test set. We obtain for each film of the test set
the following precision values, which range from 0:009 to 0:216:
Fantastic Four 0:216, Fargo 0:185, Forrest Gump 0:184, Legally
Bond 0:009, Pulp Fiction 0:101, The God Father 0:070, The
Pianist 0:124 (see Table 2). We plotted an FA (false alarm) curve
(see Figure 2) for classification analysis.
4.</p>
    </sec>
    <sec id="sec-8">
      <title>CONCLUSIONS</title>
      <p>We have developed a method based on canonical basis vectors
to represent the video. Our method uses acoustic-visual transform
(1)
(2)
(a)
and spatial grid. It builds a transform map, used as representation,
by using the assumption that motion features from violence scenes
are correlated with acoustic features.</p>
      <p>Analyzing each film result separately, high true positive and false
alarm rates demonstrate that the transformation map alone can not
distinguish all violence and non-violence shots and generalize the
violence concept. It relies on variability of types of violence in
movies and uncorrelated grid segments, which are not audio sources
and must be discarded from the map. Event as explosions, screams
and gunshots must present a visual acoustic pattern and are located
in a specific frame region. In many scenes, few grid segments
contribute with the audio dynamics.</p>
      <p>Possible directions for future work include region filtering to detect
audio sources and remove noisy segments, spatial-temporal
segmentation and feature selection.
5.</p>
    </sec>
    <sec id="sec-9">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was supported in part by two grants from CAPES and
CNPq.
6.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>C.-H. Demarty</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Penet</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Schedl</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Ionescu</surname>
            ,
            <given-names>V. L.</given-names>
          </string-name>
          <string-name>
            <surname>Quang</surname>
            , and
            <given-names>Y.-G. Jiang.</given-names>
          </string-name>
          <article-title>The mediaeval 2013 affect task: Violent scenes detection</article-title>
          . In MediaEval, MediaEval Workshop,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Hardoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Szedmak</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. R.</surname>
          </string-name>
          <article-title>Shawe-taylor. Canonical correlation analysis: An overview with application to learning methods</article-title>
          .
          <source>Neural Comput.</source>
          ,
          <volume>16</volume>
          (
          <issue>12</issue>
          ):
          <fpage>2639</fpage>
          -
          <lpage>2664</lpage>
          , Dec.
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Hotelling</surname>
          </string-name>
          .
          <source>Relations Between Two Sets of Variates. Biometrika</source>
          ,
          <volume>28</volume>
          (
          <issue>3</issue>
          /4):
          <fpage>321</fpage>
          -
          <lpage>377</lpage>
          ,
          <year>1936</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Izadinia</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Saleemi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Shah</surname>
          </string-name>
          .
          <article-title>Multimodal analysis for identification and segmentation of moving-sounding objects</article-title>
          . Multimedia, IEEE Transactions on,
          <volume>15</volume>
          (
          <issue>2</issue>
          ):
          <fpage>378</fpage>
          -
          <lpage>390</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Murphy</surname>
          </string-name>
          .
          <article-title>Dynamic Bayesian Networks: Representation, Inference and Learning</article-title>
          .
          <source>PhD thesis</source>
          , UC Berkeley, Computer Science Division,
          <year>July 2002</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>