<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LIG at MediaEval 2013 Affect Task: Use of a Generic Method and Joint Audio-Visual Words</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nadia Derbas</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bahjat Safadi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Georges Quénot UJF-Grenoble</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>/ UPMF-Grenoble</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>/ Grenoble INP / CNRS</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>LIG UMR</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Grenoble</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>France FirstName.LastName@imag.fr</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>This paper describes the LIG participation to the MediaEval 2013 A ect Task on violent scenes detection in Hollywood movies. We submitted four runs at the shot level for each subtasks: objective violent scenes detection and subjective violent scenes detection. Our four runs are: hierarchical fusion of descriptors and classi er combinations, the same with joint audio-visual words, and the same two with reranking. Our reference run obtained with the o cial MAP@100 metric a performance of 69% for the subjective violence and 52% for the objective violence. The joint audio-visual words bring a slight improvement on the MAP@100 and they improve the precision in the head of the returned list while the temporal re-ranking improves the P@100.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The MediaEval 2013 A ect Task: Violent Scenes
Detection is fully described in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. It directly derives from a
Technicolor use case which aims at easing a user's selection
process from a movie database. This task therefore applies to
movie content. This year, two di erent subtasks were
proposed for violent segments, corresponding to objective
violence and subjective violence. An objective violent scene is
de ned as \physical violence or accident resulting in human
injury or pain", a subjective one is the scene \one would not
let an 8 years old child see in a movie because it contains
physical violence".
      </p>
      <p>
        Our motivation is to test the performance of a new
descriptor based on joint audio-visual bi-modal codewords on
the violent scenes detection. As well, we aim to see how
a generic system for general concept classi cation in video
shots would perform compared to systems speci cally
designed for the task like [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Our system is a re ned version
of last year's system which was roughly a four-stage pipeline:
descriptor extraction, descriptor optimization, classi cation
and hierarchical late fusion. Besides using more descriptors,
we proposed a new multi-modal feature, the \audio-visual
words". Most of the stages have been optimized and
specifically tuned on MediaEval development data.
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>SYSTEM DESCRIPTION</title>
    </sec>
    <sec id="sec-3">
      <title>Descriptor Extraction</title>
      <p>4</p>
      <sec id="sec-3-1">
        <title>4 RGB color histogram (64-dim);</title>
        <p>texture: a 5-scale
(40-dim);</p>
      </sec>
      <sec id="sec-3-2">
        <title>8-orientation Gabor transform</title>
        <p>
          SIFT: bag of SIFT descriptors computed using Koen
van de Sande's software [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], 1000-bin histograms; four
variants were used: Harris-Laplace ltering or dense
sampling with both hard and fuzzy clustering;
audio: bag of MFCCs, 4096-bin histograms;
STIP: bag of HOFs, 4096-bin histograms;
joint audio-visual BoW: bag of MFCCs and HOFs,
32768-bin histograms (see section 2.6).
2.2
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Descriptor Optimization</title>
      <p>
        Descriptor optimization was done using a method which
combines a PCA-based dimensionality reduction with a power
transformation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The power transformation normalizes
the distributions of the values, especially in the case of
histogram components. A PCA is then performed for reducing
the size (number of dimensions) of the descriptors while
improving performance by removing noisy components.
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Classification</title>
      <p>Classi cation was done using two di erent learning
methods, one based on multiple SVMs for a better handling of
the class imbalance problem and one based on the k nearest
neighbors.
2.4</p>
    </sec>
    <sec id="sec-6">
      <title>Fusion</title>
      <p>Classi cation was done separately for each classi er and
each descriptor variant. The outputs of these individual
classi ers are then merged at the level of normalized scores
(late fusion). A linear combination of the scores is used with
weights optimized on the MediaEval development set.
2.5</p>
    </sec>
    <sec id="sec-7">
      <title>Temporal Re-ranking</title>
      <p>
        As for our participation to MediaEval 2012 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], we used a
temporal re-ranking method. The method is based on the
assumption that the violence will be more (or less) likely for
a given shot if it appears within a movie with a high (or
low) frequency of violent shots and/or if there are more (or
less) violent shots in its temporal neighborhood. We have
proposed to exploit this either at a global or at a local level
by computing a detection score either at the video or at a
      </p>
      <sec id="sec-7-1">
        <title>LIG-hierarchicalFusion LIG-hierarchicalFusionJoint LIG-hierarchicalFusionReranking LIG-hierarchicalFusionJointReranking</title>
        <p>
          0.501
0.505
0.443
0.453
0.514
0.520
0.502
0.512
0.381
0.398
0.412
0.418
0.673
0.669
0.627
0.627
0.690
0.690
0.685
0.686
0.584
0.602
0.624
0.635
0.602
0.605
0.593
0.599
neighborhood level and then re-evaluate the score of each
shot according to this global or local score. The rst step is
done by a kind of temporal smoothing and the second one
by a kind of averaging [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
2.6
        </p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Audio-Visual Representation</title>
      <p>
        For this year, we proposed a joint audio-visual
representation in order to capture the dependence/relation between
the audio and the visual information based on their
simultaneous occurrence throughout the movies for a given
concept, an idea inspired by [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In this approach the video
content is modeled using the joint relations between the
audio (MFCC) and the visual (HOF) modalities. The audio
and visual features are rst extracted from the movies
separately. These two features are then normalized and joined
by concatenating both feature vectors for each shot. Finally,
the bi-modal descriptors are quantized into bi-modal words
using a standard k-means clustering method, producing the
\joint audio-visual" bag-of-words representation.
      </p>
    </sec>
    <sec id="sec-9">
      <title>EXPERIMENTAL RESULTS</title>
      <p>We submitted four runs at the shot level for both of the
objective and subjective de nitions. The hierarchical
fusion of descriptors and classi er combinations and the same
with the joint audio-visual words and/or with temporal
reranking. The hierarchical fusion run is our baseline and the
other three are contrastive ones. Table 1 shows the
performance of the LIG system variants using di erent
metrics, the Mean Average Precision (MAP), the Precision at
100 (P@100) and the o cial MediaEval metric for this task
which is the MAP@100.</p>
      <p>Considering these metrics, our system produces quite good
results in the detection of the objective and subjective
violent scenes in movies, with an average MAP@100 of about
60; 50%. In general, our system provides better results for
the subjective de nition with a MAP@100 of about 69%
and of about 52% for the objective de nition . This could
be related to the fact that the subjective de nition is more
related to the \basic violence" than the objective one. We
observe that the hierarchical fusion with the joint
audiovisual descriptor always improves the performance in terms
of MAP@100 and especially in terms of P@100 (even if it
is a slight improvement in some case). That is due to the
fact that the bi-modal words consider the relation between
audio and visual information while the other methods fuse
them without exploiting their mutual dependence. Further,
we notice that the re-ranking improves results just in terms
of P@100 but it is slightly lowering the MAP@100 and even
more the overal MAP.</p>
    </sec>
    <sec id="sec-10">
      <title>CONCLUSIONS AND FUTURE WORK</title>
      <p>with di erent descriptors. This system includes a
hierarchical fusion of classi ers' outputs using two di erent classi
cation methods and a number of shot content descriptors.
However, two new descriptors were added this year: the
classical motion descriptor (STIP-HOF) and our proposed joint
audio-visual descriptor. Four variants of the system were
evaluated in which the joint audio-visual descriptor and the
temporal re-ranking were added or not to the baseline. Our
system obtained good results with a MAP@100 of about 69%
for the subjective de nition and of about 52% for the
objective de nition. The joint audio-visual descriptor always
improves the MAP@100 and the P@100 while the re-ranking
just improves the P@100.</p>
      <p>In the future, we plan to extend our work on the joint
audio-visual descriptor and focus on optimizing it and on
testing it with more than just two features.
5.
6.</p>
    </sec>
    <sec id="sec-11">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was partly realized as part of the Quaero
Program funded by OSEO, French State agency for innovation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>C.-H. Demarty</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Penet</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Schedl</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Ionescu</surname>
            , and
            <given-names>V. L. Q. Y.-G.</given-names>
          </string-name>
          <string-name>
            <surname>Jiang. The MediaEval 2013 A ect</surname>
          </string-name>
          <article-title>Task: Violent Scenes Detection</article-title>
          . In MediaEval 2013 Workshop, Barcelona, Spain, October
          <volume>18</volume>
          -19
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Derbas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Thollard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Safadi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Quenot</surname>
          </string-name>
          . Lig at mediaeval
          <year>2012</year>
          <article-title>a ect task: use of a generic method</article-title>
          . In MediaEval, Pisa, Italy, October 4-5
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Safadi</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Quenot</surname>
          </string-name>
          .
          <article-title>Descriptor optimization for multimedia indexing and retrieval</article-title>
          .
          <source>In CBMI</source>
          , pages
          <volume>65</volume>
          {
          <fpage>71</fpage>
          ,
          <string-name>
            <surname>Veszprem</surname>
          </string-name>
          , Hungary, June 17-19
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Safadi</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Quenot</surname>
          </string-name>
          .
          <article-title>Re-ranking for Multimedia Indexing and Retrieval</article-title>
          .
          <source>In ECIR 2011: 33rd European Conference on Information Retrieval</source>
          , pages
          <volume>708</volume>
          {
          <fpage>711</fpage>
          , Dublin, Ireland, April
          <volume>18</volume>
          -21
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>F. D.</surname>
          </string-name>
          <article-title>M. d</article-title>
          . Souza,
          <string-name>
            <given-names>G. C.</given-names>
            <surname>Chavez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. A. d. Valle</given-names>
            <surname>Jr.</surname>
          </string-name>
          , and
          <string-name>
            <given-names>A. d. A.</given-names>
            <surname>Araujo</surname>
          </string-name>
          .
          <article-title>Violence detection in video using spatio-temporal features</article-title>
          .
          <source>In Proceedings of the 2010 23rd SIBGRAPI Conference on Graphics, Patterns and Images</source>
          , pages
          <volume>224</volume>
          {
          <fpage>230</fpage>
          , Washington, DC, USA,
          <year>August</year>
          30-Septembre 3
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>K. E. A. van de Sande</surname>
            , T. Gevers, and
            <given-names>C. G. M.</given-names>
          </string-name>
          <string-name>
            <surname>Snoek</surname>
          </string-name>
          .
          <article-title>Evaluating color descriptors for object and scene recognition</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          ,
          <volume>32</volume>
          (
          <issue>9</issue>
          ):
          <volume>1582</volume>
          {
          <fpage>1596</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.-H.</given-names>
            <surname>Jhuo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-G.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.-F.</given-names>
            <surname>Chang</surname>
          </string-name>
          .
          <article-title>Joint audio-visual bi-modal codewords for video event detection</article-title>
          .
          <source>In ACM International Conference on Multimedia Retrieval (ICMR)</source>
          ,
          <source>Hong Kong, June</source>
          <volume>5</volume>
          -8
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>