<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>KIT at MediaEval 2015 - Evaluating Visual Cues for Affective Impact of Movies Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marin Vlastelica P.</string-name>
          <email>marin.vlastelicap@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sergey Hayrapetyan</string-name>
          <email>s.hayrapetyan@hotmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Makarand Tapaswi</string-name>
          <email>tapaswi@kit.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rainer Stiefelhagen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Vision for Human Computer Interaction, Karlsruhe Institute of Technology</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>We present the approach and results of our system on the MediaEval A ective Impact of Movies Task. The challenge involves two primary tasks: a ect classi cation and violence detection. We test the performance of multiple visual features followed by linear SVM classi ers. Inspired by successes in di erent vision elds, we use (i) GIST features used in scene modeling, (ii) features extracted from a deep convolutional neural network trained on object recognition, and (iii) improved dense trajectory features encoded using Fisher vectors commonly used in action recognition.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        As the number of videos grow rapidly, automatically
analyzing and indexing them is a topic of growing interest. One
interesting area is to analyze the a ect such videos have
on viewers. This can lead to improved recommendation
systems (in case of movies) or help improve overall video search
performance. Another task is to predict the amount of
violent content in the videos, thus supporting automatic lters
for sensitive videos based on viewer age. The MediaEval
2015 task { \A ective Impact of Movies" [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] studies these
two areas.
      </p>
      <p>
        The a ect task is posed as a classi cation problem on a
two-dimensional arousal-valence plane, where each
dimension is discretized to 3 values (classes). On the other hand,
the detection task is presented as a detection problem. Please
refer to [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] for task and dataset details.
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
      <p>In this section we describe the features and classi ers we
use to analyze the a ective impact of movies.</p>
    </sec>
    <sec id="sec-3">
      <title>Development splits</title>
      <p>The development set consists of 6144 short video clips
obtained from 100 di erent movies. To analyze the movies
we use a 5-fold cross-validation on the dataset. The data is
split into 5 sets with two goals in mind: (i) the source movies
in the training and test splits are di erent; (ii) the
distribution of class labels (positive/neutral/negative) is maintained
close to the original complete set.
*indicates equal contribution
In this way, we achieve 5 fairly independent splits for
training and testing our models. The splits include di ering
number of movies in the training and test sets ranging from 65/35
to 91/9.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Descriptors and models</title>
      <p>We focus primarily on simple visual cues to estimate the
a ect of videos and detect violence in them. To this end, we
use three feature types and use linear SVM classi ers.</p>
      <p>For the image-based descriptors, we extract exemplar
images from the video, sampled at every 10 frames. To
compensate for shot changes within the video clips we do not
average the features across the video and use them directly
to train our models. The video-level label is assumed to be
shared across all images of the clip.</p>
      <p>
        GIST We use GIST features that were developed in the
context of scene recognition [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We expect these features to
provide good performance on the valence task. The features
are extracted on each part of an image broken down using
a 4 4 grid to yield a 512 dimensional descriptor. We then
train multi-class linear SVM classi ers on these features for
the a ect tasks (arousal and violence) and another linear
SVM for the violence detection task.
      </p>
      <p>
        CNN features Since the ImageNet winning method
proposed by Krizhevsky, et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] in 2012, deep convolutional
neural networks (CNNs) have revolutionized computer
vision. These networks have a large number of parameters
and are trained end-to-end (from image to label) using
massive datasets. The initial layers of the convolution act as
low-level feature extractors, while the higher level fully
connected layers start learning about object shapes.
      </p>
      <p>
        Inspired by DeCAF [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], we use the BVLC Reference
CaffeNet model provided with the Ca e framework [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] as a
feature extractor. The model contains 5 convolutional layers,
2 fully connected layers and a soft-max classi er. We use
the output of the last fully connected layer to obtain 4096
dimensional features for the images from video clips.
Linear SVMs are trained on these features for all tasks. Owing
to the complexity of the model and its ability to capture
a large number of variations, we expect these features to
perform well for all tasks.
      </p>
      <p>
        Improved Dense Trajectories Dense trajectories are an
e ective descriptor for action recognition. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] recently
proposed additional steps to obtain Improved Dense
Trajectories (IDT). Unlike dense trajectories, these features estimate
and correct camera motion and thus obtain trajectories
primarily on the foreground moving objects (often human
actors). As violence in videos is often characterized by rapid
motion, we anticipate these features to work well for violence
detection.
      </p>
      <p>
        Several descriptors are computed for each trajectory {
Histogram of Oriented Gradients (HOG), Histogram of
Optical Flow (HOF), Motion Boundary Histogram (MBH) and
overall trajectory characteristics to obtain a 426 dimensional
representation for each trajectory. These features are
projected via PCA to 213 dimensions and nally encoded using
state-of-the-art Fisher vector encoding [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. This results in
a 109,056 dimensional feature representation for the entire
video. Finally, as before, we train linear SVMs using these
features.
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Software environment</title>
      <p>
        The descriptors and models were developed and trained
in Python and Matlab. We used the scikit-learn machine
learning framework [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] in Python, which uses the liblinear
SVM library [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] as the backend. For extracting features from
deep neural networks we used the Ca e framework [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] which
provides a simple interface for classi cation and feature
extraction with convolutional neural networks (CNN). We
extract IDT features using the provided code and implement
Fisher vector encoding in Matlab.
3.
      </p>
    </sec>
    <sec id="sec-6">
      <title>EVALUATION</title>
      <p>We now present and discuss the results obtained by the
visual features. The best classi er parameters were obtained
by cross-validation splits on the dev set.</p>
      <p>The a ect task that includes valence and arousal is treated
as a multi-class classi cation problem (three classes each).
The metric for these is the overall class prediction accuracy
(acc). The violence task is a detection problem and uses
average precision (ap) to evaluate di erent methods.</p>
      <p>We present the results of various run submissions in
Table 1. The run submissions are as follows:</p>
      <p>Run 1: GIST features + linear SVMs
Run 2: IDT features + linear SVMs
Run 3: CNN features + linear SVMs
Run 4: Fusion-1 + linear SVMs</p>
      <p>Run 5: Fusion-2 + linear SVMs</p>
      <p>Note that Run 3 and 5 constitute external runs (Ext) since
they use pre-trained CNN models. All other submissions are
trained solely on the development data.</p>
      <p>We see that CNN features (Run 3) outperform the rst
two runs on valence and violence which involve single
features. Contrary to expectations, IDT features (Run 2)
perform best on arousal classi cation. This can be explained
by passive videos often have very little motion, while active
have higher.</p>
      <p>While we expect IDT features to perform well on violence,
videos annotated as violent need not have active motion and
violence and can often be shots of a post-crime scene. CNN
features seem to work better in this case.</p>
      <p>Fusion runs Run 4 and 5 constitute fusion of di erent
features. Run 4, the Fusion-1 scheme uses the features provided
along with the dataset (IAV - image/audio/video
concatenated and trained as one model), GIST and IDT. Run 5,
Fusion-2 scheme includes the above along with CNN
features (thus making it an external data run).</p>
      <p>In order to fuse the di erent features, we choose the best
models for each feature type. We then perform late fusion,
where the nal score for each video is a weighted
combination of the individual feature predictions. We try a grid of
discrete weights to generate a large number of combinations
and pick the best scoring model based on cross-validation.</p>
      <p>Both fusion schemes perform equal to or better than the
single features. For Fusion-1 scheme, we see that IDT
features get the highest weight, followed by the IAV (dataset
features). In the case of Fusion-2, CNN and IDT features
are weighted higher.</p>
      <p>Error analysis We present a short analysis of the errors
we encountered in the development set. For violent video
detection, some of the di cult samples include
black-andwhite videos with rapid blinking. In case of a ect analysis,
for both valence and arousal classi cation cartoon scenes
were often deemed colorful and classi ed as positive (or
active) while their ground truth was neutral or negative (or
passive).
4.</p>
    </sec>
    <sec id="sec-7">
      <title>CONCLUSION</title>
      <p>
        We conclude that the CNN features are the best single
features for studying the a ective impact of movies task.
Fine tuning the model, or training a model to perform video
classi cation as in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] could further improve the performance.
Fusing the models results only in a slight improvement
indicating that using other modalities such as meta-data and
audio might help improve performance.
5.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Python</surname>
          </string-name>
          scikit
          <article-title>-learn: machine learning framework</article-title>
          . http://scikit-learn.org/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Donahue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <string-name>
            J. Ho man,
            <given-names>N.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , E. Tzeng, and
          <string-name>
            <given-names>T.</given-names>
            <surname>Darrell. DeCAF: A Deep Convolutional</surname>
          </string-name>
          <article-title>Activation Feature for Generic Visual Recognition</article-title>
          .
          <source>In International Conference on Machine Learning (ICML)</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.-E.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.-W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-J. Hsieh</surname>
            ,
            <given-names>X.-R.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            , and
            <given-names>C.-J.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
          </string-name>
          .
          <article-title>LIBLINEAR: A Library for Large Linear Classi cation</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>9</volume>
          :
          <year>1871</year>
          {
          <year>1874</year>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Shelhamer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Donahue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Karayev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Guadarrama</surname>
          </string-name>
          , and T. Darrell. Ca e:
          <article-title>Convolutional Architecture for Fast Feature Embedding</article-title>
          .
          <source>arXiv preprint arXiv:1408.5093</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Karpathy</surname>
          </string-name>
          , G. Toderici,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shetty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Leung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sukthankar</surname>
          </string-name>
          , and L.
          <string-name>
            <surname>Fei-Fei</surname>
          </string-name>
          .
          <article-title>Large-scale video classi cation with convolutional neural networks</article-title>
          .
          <source>In IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <article-title>ImageNet Classi cation with Deep Convolutional Neural Networks</article-title>
          .
          <source>In Neural Information Processing Systems (NIPS)</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Oliva</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Torralba</surname>
          </string-name>
          .
          <article-title>Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope</article-title>
          .
          <source>International Journal of Computer Vision (IJCV)</source>
          ,
          <volume>42</volume>
          (
          <issue>3</issue>
          ):
          <volume>145</volume>
          {
          <fpage>175</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Perronnin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sanchez</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Mensink</surname>
          </string-name>
          .
          <article-title>Improving the Fisher kernel for large-scale image classi cation</article-title>
          .
          <source>In European Conference on Computer Vision (ECCV)</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sjo</surname>
          </string-name>
          berg, Y. Baveye,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. L.</given-names>
            <surname>Quang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          , E. Dellandrea,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Demarty</surname>
            , and
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>The MediaEval 2015 A ective Impact of Movies Task</article-title>
          .
          <source>In MediaEval 2015 Workshop</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          .
          <article-title>Action Recognition with Improved Trajectories</article-title>
          .
          <source>In IEEE International Conference on Computer Vision</source>
          (ICCV),
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>