=Paper=
{{Paper
|id=Vol-1436/Paper30
|storemode=property
|title=KIT at MediaEval 2015 - Evaluating Visual Cues for Affective Impact of Movies Task
|pdfUrl=https://ceur-ws.org/Vol-1436/Paper30.pdf
|volume=Vol-1436
|dblpUrl=https://dblp.org/rec/conf/mediaeval/PHTS15
}}
==KIT at MediaEval 2015 - Evaluating Visual Cues for Affective Impact of Movies Task==
<pdf width="1500px">https://ceur-ws.org/Vol-1436/Paper30.pdf</pdf>
<pre>
         KIT at MediaEval 2015 – Evaluating Visual Cues for
                  Affective Impact of Movies Task
        Marin Vlastelica P.*, Sergey Hayrapetyan*, Makarand Tapaswi*, Rainer Stiefelhagen
               Computer Vision for Human Computer Interaction, Karlsruhe Institute of Technology, Germany
                      marin.vlastelicap@gmail.com, s.hayrapetyan@hotmail.com, tapaswi@kit.edu


ABSTRACT                                                              In this way, we achieve 5 fairly independent splits for train-
We present the approach and results of our system on the            ing and testing our models. The splits include differing num-
MediaEval Affective Impact of Movies Task. The challenge            ber of movies in the training and test sets ranging from 65/35
involves two primary tasks: affect classification and violence      to 91/9.
detection. We test the performance of multiple visual fea-
tures followed by linear SVM classifiers. Inspired by suc-          2.2    Descriptors and models
cesses in different vision fields, we use (i) GIST features            We focus primarily on simple visual cues to estimate the
used in scene modeling, (ii) features extracted from a deep         affect of videos and detect violence in them. To this end, we
convolutional neural network trained on object recognition,         use three feature types and use linear SVM classifiers.
and (iii) improved dense trajectory features encoded using             For the image-based descriptors, we extract exemplar im-
Fisher vectors commonly used in action recognition.                 ages from the video, sampled at every 10 frames. To com-
                                                                    pensate for shot changes within the video clips we do not
1.    INTRODUCTION                                                  average the features across the video and use them directly
                                                                    to train our models. The video-level label is assumed to be
                                                                    shared across all images of the clip.
   As the number of videos grow rapidly, automatically ana-
lyzing and indexing them is a topic of growing interest. One        GIST We use GIST features that were developed in the
interesting area is to analyze the affect such videos have          context of scene recognition [7]. We expect these features to
on viewers. This can lead to improved recommendation sys-           provide good performance on the valence task. The features
tems (in case of movies) or help improve overall video search       are extracted on each part of an image broken down using
performance. Another task is to predict the amount of vio-          a 4 × 4 grid to yield a 512 dimensional descriptor. We then
lent content in the videos, thus supporting automatic filters       train multi-class linear SVM classifiers on these features for
for sensitive videos based on viewer age. The MediaEval             the affect tasks (arousal and violence) and another linear
2015 task – “Affective Impact of Movies” [9] studies these          SVM for the violence detection task.
two areas.
                                                                    CNN features Since the ImageNet winning method pro-
   The affect task is posed as a classification problem on a
                                                                    posed by Krizhevsky, et al. [6] in 2012, deep convolutional
two-dimensional arousal-valence plane, where each dimen-
                                                                    neural networks (CNNs) have revolutionized computer vi-
sion is discretized to 3 values (classes). On the other hand,
                                                                    sion. These networks have a large number of parameters
the detection task is presented as a detection problem. Please
                                                                    and are trained end-to-end (from image to label) using mas-
refer to [9] for task and dataset details.
                                                                    sive datasets. The initial layers of the convolution act as
                                                                    low-level feature extractors, while the higher level fully con-
2.    APPROACH                                                      nected layers start learning about object shapes.
  In this section we describe the features and classifiers we          Inspired by DeCAF [2], we use the BVLC Reference Caf-
use to analyze the affective impact of movies.                      feNet model provided with the Caffe framework [4] as a fea-
                                                                    ture extractor. The model contains 5 convolutional layers,
2.1    Development splits                                           2 fully connected layers and a soft-max classifier. We use
   The development set consists of 6144 short video clips ob-       the output of the last fully connected layer to obtain 4096
tained from 100 different movies. To analyze the movies             dimensional features for the images from video clips. Lin-
we use a 5-fold cross-validation on the dataset. The data is        ear SVMs are trained on these features for all tasks. Owing
split into 5 sets with two goals in mind: (i) the source movies     to the complexity of the model and its ability to capture
in the training and test splits are different; (ii) the distribu-   a large number of variations, we expect these features to
tion of class labels (positive/neutral/negative) is maintained      perform well for all tasks.
close to the original complete set.
                                                                    Improved Dense Trajectories Dense trajectories are an
 *indicates equal contribution                                      effective descriptor for action recognition. [10] recently pro-
                                                                    posed additional steps to obtain Improved Dense Trajecto-
                                                                    ries (IDT). Unlike dense trajectories, these features estimate
Copyright is held by the author/owner(s).                           and correct camera motion and thus obtain trajectories pri-
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany         marily on the foreground moving objects (often human ac-
tors). As violence in videos is often characterized by rapid
motion, we anticipate these features to work well for violence   Table 1: Evaluation on the test set. The first three
detection.                                                       runs use a single feature, while the latter use late
   Several descriptors are computed for each trajectory – His-   fusion.
togram of Oriented Gradients (HOG), Histogram of Opti-                    Ext. Valence (acc) Arousal (acc) Violence (ap)
cal Flow (HOF), Motion Boundary Histogram (MBH) and
                                                                  Run 1    -         35.5            30.8             7.1
overall trajectory characteristics to obtain a 426 dimensional
                                                                  Run 2    -         36.0            46.7             8.6
representation for each trajectory. These features are pro-
                                                                  Run 3    X         38.5            44.7            10.2
jected via PCA to 213 dimensions and finally encoded using
state-of-the-art Fisher vector encoding [8]. This results in      Run 4    -         35.7            46.7            10.7
a 109,056 dimensional feature representation for the entire       Run 5    X         38.5            51.9            12.9
video. Finally, as before, we train linear SVMs using these
features.
                                                                 where the final score for each video is a weighted combina-
2.3     Software environment                                     tion of the individual feature predictions. We try a grid of
   The descriptors and models were developed and trained         discrete weights to generate a large number of combinations
in Python and Matlab. We used the scikit-learn machine           and pick the best scoring model based on cross-validation.
learning framework [1] in Python, which uses the liblinear          Both fusion schemes perform equal to or better than the
SVM library [3] as the backend. For extracting features from     single features. For Fusion-1 scheme, we see that IDT fea-
deep neural networks we used the Caffe framework [4] which       tures get the highest weight, followed by the IAV (dataset
provides a simple interface for classification and feature ex-   features). In the case of Fusion-2, CNN and IDT features
traction with convolutional neural networks (CNN). We ex-        are weighted higher.
tract IDT features using the provided code and implement
Fisher vector encoding in Matlab.                                Error analysis We present a short analysis of the errors
                                                                 we encountered in the development set. For violent video
3.    EVALUATION                                                 detection, some of the difficult samples include black-and-
                                                                 white videos with rapid blinking. In case of affect analysis,
  We now present and discuss the results obtained by the
                                                                 for both valence and arousal classification cartoon scenes
visual features. The best classifier parameters were obtained
                                                                 were often deemed colorful and classified as positive (or ac-
by cross-validation splits on the dev set.
                                                                 tive) while their ground truth was neutral or negative (or
  The affect task that includes valence and arousal is treated
                                                                 passive).
as a multi-class classification problem (three classes each).
The metric for these is the overall class prediction accuracy
(acc). The violence task is a detection problem and uses         4.   CONCLUSION
average precision (ap) to evaluate different methods.               We conclude that the CNN features are the best single
  We present the results of various run submissions in Ta-       features for studying the affective impact of movies task.
ble 1. The run submissions are as follows:                       Fine tuning the model, or training a model to perform video
     • Run 1: GIST features + linear SVMs                        classification as in [5] could further improve the performance.
     • Run 2: IDT features + linear SVMs                         Fusing the models results only in a slight improvement in-
     • Run 3: CNN features + linear SVMs                         dicating that using other modalities such as meta-data and
     • Run 4: Fusion-1 + linear SVMs                             audio might help improve performance.
     • Run 5: Fusion-2 + linear SVMs
  Note that Run 3 and 5 constitute external runs (Ext) since     5.   REFERENCES
they use pre-trained CNN models. All other submissions are        [1] Python scikit-learn: machine learning framework.
trained solely on the development data.                               http://scikit-learn.org/.
  We see that CNN features (Run 3) outperform the first           [2] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang,
two runs on valence and violence which involve single fea-            E. Tzeng, and T. Darrell. DeCAF: A Deep
tures. Contrary to expectations, IDT features (Run 2) per-            Convolutional Activation Feature for Generic Visual
form best on arousal classification. This can be explained            Recognition. In International Conference on Machine
by passive videos often have very little motion, while active         Learning (ICML), 2014.
have higher.                                                      [3] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang,
  While we expect IDT features to perform well on violence,           and C.-J. Lin. LIBLINEAR: A Library for Large
videos annotated as violent need not have active motion and           Linear Classification. Journal of Machine Learning
violence and can often be shots of a post-crime scene. CNN            Research, 9:1871–1874, 2008.
features seem to work better in this case.
                                                                  [4] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev,
Fusion runs Run 4 and 5 constitute fusion of different fea-           J. Long, R. Girshick, S. Guadarrama, and T. Darrell.
tures. Run 4, the Fusion-1 scheme uses the features provided          Caffe: Convolutional Architecture for Fast Feature
along with the dataset (IAV - image/audio/video concate-              Embedding. arXiv preprint arXiv:1408.5093, 2014.
nated and trained as one model), GIST and IDT. Run 5,             [5] A. Karpathy, G. Toderici, S. Shetty, T. Leung,
Fusion-2 scheme includes the above along with CNN fea-                R. Sukthankar, and L. Fei-Fei. Large-scale video
tures (thus making it an external data run).                          classification with convolutional neural networks. In
  In order to fuse the different features, we choose the best         IEEE Conference on Computer Vision and Pattern
models for each feature type. We then perform late fusion,            Recognition (CVPR), 2014.
[6] A. Krizhevsky, I. Sutskever, and G. E. Hinton.                    European Conference on Computer Vision (ECCV),
    ImageNet Classification with Deep Convolutional                   2010.
    Neural Networks. In Neural Information Processing             [9] M. Sjöberg, Y. Baveye, H. Wang, V. L. Quang,
    Systems (NIPS), 2012.                                             B. Ionescu, E. Dellandréa, M. Schedl, C.-H. Demarty,
[7] A. Oliva and A. Torralba. Modeling the Shape of the               and L. Chen. The MediaEval 2015 Affective Impact of
    Scene: A Holistic Representation of the Spatial                   Movies Task. In MediaEval 2015 Workshop, 2015.
    Envelope. International Journal of Computer Vision           [10] H. Wang and C. Schmid. Action Recognition with
    (IJCV), 42(3):145–175, 2001.                                      Improved Trajectories. In IEEE International
[8] F. Perronnin, J. Sánchez, and T. Mensink. Improving              Conference on Computer Vision (ICCV), 2013.
    the Fisher kernel for large-scale image classification. In

</pre>