<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>RFA at MediaEval 2015 Affective Impact of Movies Task: A Multimodal Approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ionu¸t Mironica˘</string-name>
          <email>imironica@imag.pub.ro</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bogdan Ionescu</string-name>
          <email>bionescu@imag.pub.ro</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mats Sjöberg</string-name>
          <email>mats.sjoberg@helsinki.fi</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Markus Schedl</string-name>
          <email>markus.schedl@jku.at</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marcin Skowron</string-name>
          <email>marcin.skowron@ofai.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Austrian Research Institute for, Artificial Intelligence</institution>
          ,
          <addr-line>Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Helsinki Institute for, Information Technology HIIT, University of Helsinki</institution>
          ,
          <country country="FI">Finland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Johannes Kepler University</institution>
          ,
          <addr-line>Linz</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University Politehnica of</institution>
          ,
          <addr-line>Bucharest</addr-line>
          ,
          <country country="RO">Romania</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>The MediaEval 2015 A ective Impact of Movies Task challenged participants to automatically nd violent scenes in a set of videos and, also, to predict the a ective impact that video content will have on viewers. We propose the use of several multimodal descriptors, such as visual, motion and auditory features, then we fuse their predictions to detect the violent or a ective content. Our best-performing run with regard to the o cial metric received a MAP of 0.1419 in the violence detection task, and an accuracy of 45.038% for the arousal estimation and 36.123% for the valence estimation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>The MediaEval 2015 A ective Impact of Movies Task [6]
challenged participants to develop algorithms for nding
violent scenes in movies. Also, in contrast to previous years,
the organizers introduced a completely new subtask for
detecting the emotional impact of movies. The task provided
a dataset of 10,900 short video clips extracted from 199
Creative Commons-licensed movies. Detailed description of the
task, the dataset, the ground truth and evaluation criteria
are given in the paper by Sjoberg et al. [6].</p>
      <p>
        Our system this year is largely based on several
multimodal systems that already obtained good results on similar
problems [
        <xref ref-type="bibr" rid="ref3">3, 4, 5</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>METHOD</title>
      <p>
        Our system builds on a set of visual, motion and auditory
features, combined with a Support Vector Machine (SVM)
classi er to obtain a violence or an a ect score for each video
document. First, we perform the feature extraction at the
frame level. The resulting features are aggregated in one
video descriptor using di erent strategies: the average of
features, Fisher kernel(FK) [4] or Vector of Locally Aggregated
Descriptors(VLAD) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Finally, the global video descriptors
are fed into a SVM multi-classi er framework. These steps
are detailed in the following.
2.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Feature set</title>
      <p>
        Visual: We extracted ColorSIFT features [8] using the
opponent colour space and spatial pyramids with two di
erent sampling strategies: the Harris-Laplace salient point
detector and dense sampling. We employed the
Bag-of-VisualWords (BoVW) approach where each spatial pyramid
partition is represented by a 1,000 dimensional histogram over its
ColorSIFT features. We also computed the CENsus
TRansform hISTogram (CENTRIST) descriptor proposed in [9].
In addition, we used a total of four Convolutional Neural
Networks (CNN) features, using the protocol laid out in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
The used CNNs were trained on either ImageNet 2010 or
2012 training datasets, following as closely as possible the
network structure parameters of Krizhevsky et al [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
Furthermore, the input images were resized to 256 256 pixels
either by distortion or center cropping, thus giving in total
four di erent CNNs from which we extract four di erent sets
of feature vectors. We use the activations of the rst
fullyconnected layer of each network as our features, which
results in 4096-dimensional feature vectors. Ten regions were
extracted from the test images as suggested in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] (four
corners, center patch plus ipping) and then a component-wise
maximum is taken of the region-wise features.
      </p>
      <p>Auditory: As for audio features, we used descriptors
provided within the block-level framework [5]. They have
been proven to be useful for retrieval, classi cation, and
similarity tasks in the audio and music domain. More
precisely, we computed for the audio channel of each video its
spectral pattern (considers the cent-scaled spectrum on a
10-frame-basis to characterize frequency and timbre), delta
spectral pattern (computes the di erence between the
original spectrum and a copy of the spectrum delayed by 3
frames), variance delta spectral pattern (considers the
variance between the delta spectral blocks), logarithmic
uctuation pattern (applies several psychoacoustic models and
characterizes the amplitude modulations), correlation
pattern (computes Pearson's correlation between all pairs of 52
cent-scaled frequency bands), and spectral contrast pattern
(computes the di erence between spectral peaks and valleys
in 20 cent-scaled frequency bands). We eventually end up
with each clip being characterized by a 9,448-dimensional
feature vector that models its audio content.</p>
      <p>Motion: We computed the Histogram of Oriented
Gradients (3D-HoG) and Histograms of Optical Flows (3D-HoF)
cuboids motion features [7]. First of all, we computed each
feature in 3D blocks with a dense sampling strategy: rst
the gradient magnitude responses in horizontal and vertical
directions are computed. Then, for each response the
magnitude was quantized in k orientations, where k = 8. Finally,
these responses were aggregated over blocks of pixels in both
spatial and temporal directions and concatenated.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Frame aggregation</title>
      <p>
        Results from the literature showed that adopting Fisher
kernel [4] and VLAD [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] representations in many video
classi cation tasks allow for achieving higher accuracy than the
use of traditional Bag-of-Words histogram representations.
This is because these representations capture temporal
variation over the frames within a video. We used two classical
methods to encode the temporal variation over frame-based
features, the Fisher Kernel [4] and a modi ed version of
Vector of Locally Aggregated Descriptors [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Then, we
aggregated the frame features already presented in Section 2.1.
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Classifier</title>
      <p>The nal component of the system consists of the data
classi er which is fed with the multimodal descriptors
computed on previous steps. Among the broad choice of existing
classi cation approaches, we selected a SVM classi er. We
tested several type of kernels, i.e., a fast linear kernel and
two nonlinear kernels: RBF and Chi-Square. While linear
SVMs are very fast in both training and testing, SVMs with
nonlinear kernels are more accurate in many classi cation
tasks due to better adaptation to the shape of the clusters
in the feature space.</p>
      <p>Finally, in the case of multimodal features, we combine
the SVMs output con dence values using max late-fusion
combination:</p>
      <p>N
CombM ean(d; q) = max cvi
i=1
(1)
where cvi is the con dence value of classi er i for class q
(q 2 f1; :::; Cg), C represents the number of classes, d is
the current video, and N is the number of classi ers to be
aggregated.
3.1</p>
    </sec>
    <sec id="sec-6">
      <title>EXPERIMENTAL RESULTS</title>
    </sec>
    <sec id="sec-7">
      <title>Submitted runs</title>
      <p>We submitted ve runs for both tasks: the violence
detection task and the induced a ect detection task. For the
rst run we combined the audio features with a nonlinear
SVM classi er. For the second run, we combined several
visual features (BoVW-ColorSIFT, CENTRIST histograms
and CNN features) with nonlinear SVM classi er. The next
run uses a combination of modi ed VLAD with motion
3DHoG/3D-HoF motion features with nonlinear SVM
classiers. In the fourth run, we propose the aggregation of the
CNN frame features with the Fisher kernel representation.
Then, we used a linear SVM classi er. Finally, for the fth
task we performed a late fusion strategy of the rst four
runs.
3.2</p>
    </sec>
    <sec id="sec-8">
      <title>Results and discussion</title>
      <p>Table 1 details the results for all our runs. The third
column presents the MAP results obtained on the violence
task, while the next two columns provide the nal accuracy
on the second task: the valence and arousal predictions.</p>
      <p>Audio features and standard visual features performed
poorly in the violence task. On the other side, the
combination of VLAD with motion features obtained better
results. The best results are obtained using Fisher kernel with
CNN visual features. Fusing all the features together did not
improve the results above the FK-CNN only result. In
contrast, in the induced a ect detection task all combinations
perform similarly, except for audio features which have a
clearly better result.
4.</p>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSIONS</title>
      <p>In this paper, we presented several multimodal approaches
for the detection of violent content in movies. We obtained
the best results on the violence task by using motion and
visual features. On the other side, we obtained the best results
on the a ect task using the audio features only. The visual /
motion features obtained lower results for both valence and
arousal predictions. One reason for this is that the visual
features do not t on the purpose of the a ect task. It also
indicates that the a ect task is more challenging than the
violence task.</p>
    </sec>
    <sec id="sec-10">
      <title>Acknowledgements</title>
      <p>We received support by the Austrian Science Fund (FWF):
P25655 and the InnoRESEARCH POSDRU /159/1.5/S/
132395 program.
5.
Approach for Fast Video Classi cation. Multimedia
Tools and Applications (MTAP), 2015.
[4] I. Mironica, J. Uijlings, N. Rostamzadeh, B. Ionescu,
and N. Sebe. Time Matters! Capturing Variation in
Time in Video using Fisher Kernels. In ACM</p>
      <p>Multimedia, Barcelona, Spain, 21-25 October 2013.
[5] K. Seyerlehner, G. Widmer, M. Schedl, and P. Knees.</p>
      <p>Automatic Music Tag Classi cation based on
Block-Level Features. In Proceedings of the 7th Sound
and Music Computing Conference (SMC 2010),
Barcelona, Spain, July 2010.
[6] M. Sjoberg, Y. Baveye, H. Wang, V. L. Quang,
B. Ionescu, E. Dellandrea, M. Schedl, C.-H. Demarty,
and L. Chen. The MediaEval 2015 A ective Impact of
Movies Task. In MediaEval 2015 Workshop, Wurzen,
Germany, September 14-15 2015.
[7] J. Uijlings, I. Duta, E. Sangineto, and N. Sebe. Video
classi cation with densely extracted hog/hof/mbh
features: an evaluation of the accuracy/computational
e ciency trade-o . International Journal of Multimedia
Information Retrieval, pages 1{12, 2014.
[8] K. E. A. van de Sande, T. Gevers, and C. G. M. Snoek.</p>
      <p>Evaluating color descriptors for object and scene
recognition. IEEE Transactions on Pattern Analysis
and Machine Intelligence (PAMI), 32(9):1582{1596,
2010.
[9] J. Wu and J. M. Rehg. CENTRIST: A visual descriptor
for scene categorization. IEEE Transactions on Pattern
Analysis and Machine Intelligence (PAMI),
33(8):1489{1501, 2011.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Koskela</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Laaksonen</surname>
          </string-name>
          .
          <article-title>Convolutional network features for scene recognition</article-title>
          .
          <source>In Proceedings of the 22nd International Conference on Multimedia, Orlando</source>
          , Florida,
          <year>November 2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          , and
          <string-name>
            <surname>G. Hinton.</surname>
          </string-name>
          <article-title>ImageNet classi cation with deep convolutional neural networks</article-title>
          .
          <source>In Conference on Neural Information Processing Systems (NIPS)</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>I.</given-names>
            <surname>Mironica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Duta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Sebe</surname>
          </string-name>
          . A Modi ed Vector of Locally Aggregated Descriptors
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>