<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UMons at MediaEval 2015 Affective Impact of Movies Task including Violent Scenes Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Omar Seddati</string-name>
          <email>omar.seddati@umons.ac.be</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emre Kulah</string-name>
          <email>emre.kulah@ceng.metu.edu.tr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gueorgui Pironkov</string-name>
          <email>gueorgui.pironkov@umons.ac.be</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stéphane Dupont</string-name>
          <email>stephane.dupont@umons.ac.be</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Saïd Mahmoudi</string-name>
          <email>said.mahmoudi@umons.ac.be</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thierry Dutoit</string-name>
          <email>thierry.dutoit@umons.ac.be</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Middle East Technical University</institution>
          ,
          <addr-line>Ankara</addr-line>
          ,
          <country country="TR">Turkey</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Mons</institution>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>In this paper, we present the work done at UMons regarding the MediaEval 2015 A ective Impact of Movies Task (including Violent Scenes Detection). This task can be divided into two subtasks. On the one hand, Violent Scene Detection, which means automatically nding scenes that are violent in a set if videos. And on the other hand, evaluate the a ective impact of the video, through an estimation of the valence and arousal. In order to o er a solution for both detection and classi cation subtasks, we investigate different visual and auditory feature extraction methods. An i-vector approach is applied for the audio, and optical ow maps processed through a deep convolutional neural network are tested for extracting features from the video. Classi ers based on probabilistic linear discriminant analysis and fully connected feed-forward neural networks are then used.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>With the increasing amount of video content available, the
aim of MediaEval 2015 \A ective Impact of Movies Task\ is
to show users (depending on their age, preferences or mood)
the content they are looking for. More precisely, this year
the task focuses on two di erent aspects.</p>
      <p>The rst subtask is Violent Scene Detection (VSD), the
goal being to alert parents about the potentially violent
content of a video. Thus, the criterion for VSD used for
annotation is: \videos one would not let an 8 years old child
see because of their physical violence\. Another possible
application could be facilitating video surveillance alerts, as
monitoring several screens simultaneously is a complicated
task, even for humans.</p>
      <p>
        Additionally to VSD, and for the rst time at this year's
MediaEval workshop, a second subtask is examined:
Induced A ect Detection. This subtask focuses on the impact
emotions can have for video or movie suggestions. Each
video scene is categorized depending of its valence class
(positive - neutral - negative) and its arousal class (active -
neutral - passive). The purpose here is to predict the feelings
that a particular video will cause to an user in order to
recommend him similar or completely di erent content.
Both subtasks are examined on the same dataset. Around
10,000 video clips from professional and amateur movies are
used, all under Creative Commons license. More
information about these subtasks can be found in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
      <p>We use the same techniques for the VSD and a ect
detection subtasks. In our approach audio and video information
are analyzed separately. Thus, two di erent feature
extraction methods are applied depending of the features.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Audio approach</title>
      <p>
        For the audio processing we use the same method as [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
where i-vectors and Probabilistic Linear Discriminant
Analysis (pLDA) are used to classify environments (wedding
ceremony, birthday party, parade, etc.). The i-vector approach
consists of extracting a low-dimensional feature vector from
high-dimensional data without losing most of the relevant
acoustic information. This method was introduced by the
speaker recognition community and has also proven its
efciency in language detection or in speaker adaptation for
speech recognition.
      </p>
      <p>
        In order to extract the i-vectors and classify them through
pLDA, we have used the Matlab MSR Identity Toolbox [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
For each audio track of the video shots, we extract 20
Melfrequency cepstral coe cients, and the associated rst and
second derivatives. Thus, we use as input 60-dimensional
features with a xed length of 800 frames for each shot.
For each shot a 100-dimensional i-vector is extracted. All
the i-vectors are then processed through three independent
classi ers. The rst one is trained to classify violent and
non-violent scenes. The second one di erentiate positive,
neutral and negative valence. The third one is trained on
the three di erent levels of arousal.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Video approach</title>
      <p>Convolutional neural networks (ConvNets) are a
state-ofthe-art technique in the eld of object recognition within
images. ConvNets applied to 2D images are adapted to capture
spatial con gurations. Using them to capture temporal
information related to changes between video frames requires
using several frames as input. A drawback is that it
significantly increases the dimensionality of the input. Thus, an
alternative approach consists of using optical ow maps as
input. Each map represents the motion of each pixel
between two successive frames.</p>
      <p>
        We used TV-L1 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] algorithm from the OpenCV toolbox
for optical ow extraction. We use 10 stacked optical ow
frames as input. Note that 10 stacked optical ows equals
20 maps given that both horizontal and vertical components
have to be provided. In order to reduce over tting we use
dropout, as well as data augmentation by cropping and
ipping randomly the maps of the input sequence. We also
estimate the motion of the camera by calculating the mean
across the maps of the same component (horizontal and
vertical), then we subtract the corresponding mean. Our
system is tested on the publicly available Torch toolbox [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
which o ers a powerful and varied set of tools, especially for
building and training ConvNets. The details for the used
architecture are listed in Table 1.
      </p>
      <p>Using dense optical ow maps, means that the size of the
neural network increases rapidly with the length of the
sequence used as input. This implies that short sub-sequences
of video frames (or rather optical ow maps) have to be
used as input to the ConvNet. This increases the risk that
those sub-sequences fall on parts of the video where there is
no useful information for the identi cation of the category.
To tackle this problem, we use a sliding window approach
at test time, estimating the probability for each category
in several sub-sequences of the video. The class with the
highest probability after averaging over all the di erent
subsequence probabilities is selected as the most likely class.</p>
      <p>
        We also train a ConvNet with the same architecture on
the HMDB-51 dataset [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] (action recognition benchmark),
in order to build a more robust motion feature extractor
leveraging this additional external data. Then, we extract
features from the MediaEval annotated data and train a two
      </p>
    </sec>
    <sec id="sec-5">
      <title>RESULTS AND DISCUSSION</title>
      <p>We have submitted three runs for both subtasks. The
results for VSD tasks are presented in Table 2. The Mean
Average Precision (MAP) is computed for each run. We can
see that using external data from HMDB in order to train
the feature extractor is less e cient than training the feature
extractor on the MediaEval dataset. The i-vector &amp; pLDA
technique present similar results as the optical ow maps &amp;
ConvNets association.</p>
      <p>The global accuracy for the a ect detection task is shown
if Table 3. For the valence, all methods give similar results.
A di erence appears for the arousal task. The audio features
perform poorly in comparison to the other runs. Using
external data proves here to be more interesting as the last run
signi cantly outperforms the second run. Motion seems to
be an important discriminative factor for arousal estimation.
3.1</p>
    </sec>
    <sec id="sec-6">
      <title>Discussion</title>
      <p>We have also investigated merging the audio and visual
features together. The features from the ConvNets
extractor and the i-vectors were used as input to another neural
network. But the results were poorer than using the features
separately. Further work will investigate audio-visual fusion
more in depth.
4.</p>
    </sec>
    <sec id="sec-7">
      <title>CONCLUSION</title>
      <p>In this paper we presented two approaches for both a ect
and violent scene detection. Visual and audio features are
processed separately. Both features are giving similar
results for violence detection and valence. For arousal, video
features are far more interesting, especially when the
ConvNets feature extractor is trained on external data. Our
future work will focus on the merging the audio and video
features.</p>
    </sec>
    <sec id="sec-8">
      <title>ACKNOWLEDGEMENTS</title>
      <p>This work has been partly funded by the Walloon Region
of Belgium through the Chist-Era IMOTION project
(Intelligent Multi-Modal Augmented Video Motion Retrieval
System) and by the European Regional Development Fund
(ERDF) through the DigiSTORM project.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Collobert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Farabet</surname>
          </string-name>
          .
          <article-title>Torch7: A matlab-like environment for machine learning</article-title>
          .
          <source>In BigLearn, NIPS Workshop, number EPFL-CONF-192376</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Elizalde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lei</surname>
          </string-name>
          , and
          <string-name>
            <surname>G. Friedland.</surname>
          </string-name>
          <article-title>An i-vector representation of acoustic environments for audio-based video event detection on user generated content</article-title>
          .
          <source>In Multimedia (ISM)</source>
          ,
          <source>2013 IEEE International Symposium on</source>
          , pages
          <volume>114</volume>
          {
          <fpage>117</fpage>
          . IEEE,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Kuehne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jhuang</surname>
          </string-name>
          , E. Garrote,
          <string-name>
            <given-names>T.</given-names>
            <surname>Poggio</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Serre</surname>
          </string-name>
          .
          <article-title>HMDB: a large video database for human motion recognition</article-title>
          .
          <source>In Computer Vision</source>
          (ICCV),
          <year>2011</year>
          IEEE International Conference on, pages
          <volume>2556</volume>
          {
          <fpage>2563</fpage>
          . IEEE,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Meinhardt-Llopis</surname>
          </string-name>
          , and
          <string-name>
            <surname>G. Facciolo.</surname>
          </string-name>
          <article-title>TV-L1 optical ow estimation</article-title>
          .
          <source>Image Processing On Line</source>
          ,
          <year>2013</year>
          :
          <volume>137</volume>
          {
          <fpage>150</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S. O.</given-names>
            <surname>Sadjadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Slaney</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Heck</surname>
          </string-name>
          .
          <source>MSR identity toolbox v1</source>
          .
          <article-title>0: A MATLAB toolbox for speaker recognition research</article-title>
          .
          <source>Speech and Language Processing Technical Committee Newsletter</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sjo</surname>
          </string-name>
          berg,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Baveye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Dellandrea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. L.</given-names>
            <surname>Quang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.-H.</given-names>
            <surname>Demarty</surname>
          </string-name>
          .
          <article-title>The MediaEval 2015 a ective impact of movies task</article-title>
          .
          <source>In MediaEval 2015 Workshop</source>
          , Wurzen, Germany,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>