=Paper=
{{Paper
|id=Vol-1436/Paper37
|storemode=property
|title=RFA at MediaEval 2015 Affective Impact of Movies Task: A Multimodal Approach
|pdfUrl=https://ceur-ws.org/Vol-1436/Paper37.pdf
|volume=Vol-1436
|dblpUrl=https://dblp.org/rec/conf/mediaeval/MironicaISSS15
}}
==RFA at MediaEval 2015 Affective Impact of Movies Task: A Multimodal Approach==
<pdf width="1500px">https://ceur-ws.org/Vol-1436/Paper37.pdf</pdf>
<pre>
     RFA at MediaEval 2015 Affective Impact of Movies Task:
                   A Multimodal Approach

                  Ionuţ Mironică                      Bogdan Ionescu                        Mats Sjöberg
              University Politehnica of              University Politehnica of            Helsinki Institute for,
               Bucharest, Romania                     Bucharest, Romania              Information Technology HIIT
               imironica@imag.pub.ro                   bionescu@imag.pub.ro           University of Helsinki, Finland
                                                                                         mats.sjoberg@helsinki.fi
                                     Markus Schedl                         Marcin Skowron
                               Johannes Kepler University,          Austrian Research Institute for
                                     Linz, Austria                      Artificial Intelligence,
                                   markus.schedl@jku.at                    Vienna, Austria
                                                                       marcin.skowron@ofai.at


ABSTRACT                                                             2.1    Feature set
The MediaEval 2015 Affective Impact of Movies Task chal-                Visual: We extracted ColorSIFT features [8] using the
lenged participants to automatically find violent scenes in a        opponent colour space and spatial pyramids with two differ-
set of videos and, also, to predict the affective impact that        ent sampling strategies: the Harris-Laplace salient point de-
video content will have on viewers. We propose the use of            tector and dense sampling. We employed the Bag-of-Visual-
several multimodal descriptors, such as visual, motion and           Words (BoVW) approach where each spatial pyramid parti-
auditory features, then we fuse their predictions to detect          tion is represented by a 1,000 dimensional histogram over its
the violent or affective content. Our best-performing run            ColorSIFT features. We also computed the CENsus TRans-
with regard to the official metric received a MAP of 0.1419          form hISTogram (CENTRIST) descriptor proposed in [9].
in the violence detection task, and an accuracy of 45.038%           In addition, we used a total of four Convolutional Neural
for the arousal estimation and 36.123% for the valence esti-         Networks (CNN) features, using the protocol laid out in [1].
mation.                                                              The used CNNs were trained on either ImageNet 2010 or
                                                                     2012 training datasets, following as closely as possible the
1.   INTRODUCTION                                                    network structure parameters of Krizhevsky et al [2]. Fur-
                                                                     thermore, the input images were resized to 256×256 pixels
  The MediaEval 2015 Affective Impact of Movies Task [6]             either by distortion or center cropping, thus giving in total
challenged participants to develop algorithms for finding vi-        four different CNNs from which we extract four different sets
olent scenes in movies. Also, in contrast to previous years,         of feature vectors. We use the activations of the first fully-
the organizers introduced a completely new subtask for de-           connected layer of each network as our features, which re-
tecting the emotional impact of movies. The task provided            sults in 4096-dimensional feature vectors. Ten regions were
a dataset of 10,900 short video clips extracted from 199 Cre-        extracted from the test images as suggested in [2] (four cor-
ative Commons-licensed movies. Detailed description of the           ners, center patch plus flipping) and then a component-wise
task, the dataset, the ground truth and evaluation criteria          maximum is taken of the region-wise features.
are given in the paper by Sjöberg et al. [6].                          Auditory: As for audio features, we used descriptors
  Our system this year is largely based on several multi-            provided within the block-level framework [5]. They have
modal systems that already obtained good results on similar          been proven to be useful for retrieval, classification, and
problems [3, 4, 5].                                                  similarity tasks in the audio and music domain. More pre-
                                                                     cisely, we computed for the audio channel of each video its
2.   METHOD                                                          spectral pattern (considers the cent-scaled spectrum on a
   Our system builds on a set of visual, motion and auditory         10-frame-basis to characterize frequency and timbre), delta
features, combined with a Support Vector Machine (SVM)               spectral pattern (computes the difference between the orig-
classifier to obtain a violence or an affect score for each video    inal spectrum and a copy of the spectrum delayed by 3
document. First, we perform the feature extraction at the            frames), variance delta spectral pattern (considers the vari-
frame level. The resulting features are aggregated in one            ance between the delta spectral blocks), logarithmic fluc-
video descriptor using different strategies: the average of fea-     tuation pattern (applies several psychoacoustic models and
tures, Fisher kernel(FK) [4] or Vector of Locally Aggregated         characterizes the amplitude modulations), correlation pat-
Descriptors(VLAD) [3]. Finally, the global video descriptors         tern (computes Pearson’s correlation between all pairs of 52
are fed into a SVM multi-classifier framework. These steps           cent-scaled frequency bands), and spectral contrast pattern
are detailed in the following.                                       (computes the difference between spectral peaks and valleys
                                                                     in 20 cent-scaled frequency bands). We eventually end up
                                                                     with each clip being characterized by a 9,448-dimensional
Copyright is held by the author/owner(s).
                                                                     feature vector that models its audio content.
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany
                                      Table 1: Results for the submitted runs.
                                                                 Violence task           Affect task
         Description                                                     MAP Accuracy valence Accuracy arousal
 run 1   Average on audio descriptors & nonlinear SVM                   0.0485       33.032%          45.038%
 run 2   Average on visual features & nonlinear SVM                     0.0452       36.123%          34.104%
 run 3   Modified VLAD with motion features & linear SVM                0.0768       29.731%          39.865%
 run 4   Fisher kernel with CNN visual features [2] & linear SVM        0.1419       30.320%          44.365%
 run 5   Late fusion between all the previous runs                      0.0824       29.752%          37.595%


  Motion: We computed the Histogram of Oriented Gradi-           run uses a combination of modified VLAD with motion 3D-
ents (3D-HoG) and Histograms of Optical Flows (3D-HoF)           HoG/3D-HoF motion features with nonlinear SVM classi-
cuboids motion features [7]. First of all, we computed each      fiers. In the fourth run, we propose the aggregation of the
feature in 3D blocks with a dense sampling strategy: first       CNN frame features with the Fisher kernel representation.
the gradient magnitude responses in horizontal and vertical      Then, we used a linear SVM classifier. Finally, for the fifth
directions are computed. Then, for each response the mag-        task we performed a late fusion strategy of the first four
nitude was quantized in k orientations, where k = 8. Finally,    runs.
these responses were aggregated over blocks of pixels in both
spatial and temporal directions and concatenated.                3.2   Results and discussion
                                                                    Table 1 details the results for all our runs. The third
2.2   Frame aggregation                                          column presents the MAP results obtained on the violence
   Results from the literature showed that adopting Fisher       task, while the next two columns provide the final accuracy
kernel [4] and VLAD [3] representations in many video clas-      on the second task: the valence and arousal predictions.
sification tasks allow for achieving higher accuracy than the       Audio features and standard visual features performed
use of traditional Bag-of-Words histogram representations.       poorly in the violence task. On the other side, the com-
This is because these representations capture temporal vari-     bination of VLAD with motion features obtained better re-
ation over the frames within a video. We used two classical      sults. The best results are obtained using Fisher kernel with
methods to encode the temporal variation over frame-based        CNN visual features. Fusing all the features together did not
features, the Fisher Kernel [4] and a modified version of        improve the results above the FK-CNN only result. In con-
Vector of Locally Aggregated Descriptors [3]. Then, we ag-       trast, in the induced affect detection task all combinations
gregated the frame features already presented in Section 2.1.    perform similarly, except for audio features which have a
                                                                 clearly better result.
2.3   Classifier
   The final component of the system consists of the data        4.    CONCLUSIONS
classifier which is fed with the multimodal descriptors com-
puted on previous steps. Among the broad choice of existing        In this paper, we presented several multimodal approaches
classification approaches, we selected a SVM classifier. We      for the detection of violent content in movies. We obtained
tested several type of kernels, i.e., a fast linear kernel and   the best results on the violence task by using motion and vi-
two nonlinear kernels: RBF and Chi-Square. While linear          sual features. On the other side, we obtained the best results
SVMs are very fast in both training and testing, SVMs with       on the affect task using the audio features only. The visual /
nonlinear kernels are more accurate in many classification       motion features obtained lower results for both valence and
tasks due to better adaptation to the shape of the clusters      arousal predictions. One reason for this is that the visual
in the feature space.                                            features do not fit on the purpose of the affect task. It also
   Finally, in the case of multimodal features, we combine       indicates that the affect task is more challenging than the
the SVMs output confidence values using max late-fusion          violence task.
combination:
                                      N                          Acknowledgements
                CombM ean(d, q) = max cvi                 (1)
                                      i=1                        We received support by the Austrian Science Fund (FWF):
where cvi is the confidence value of classifier i for class q    P25655 and the InnoRESEARCH POSDRU /159/1.5/S/
(q ∈ {1, ..., C}), C represents the number of classes, d is      132395 program.
the current video, and N is the number of classifiers to be
aggregated.                                                      5.    REFERENCES
                                                                 [1] M. Koskela and J. Laaksonen. Convolutional network
3.    EXPERIMENTAL RESULTS                                           features for scene recognition. In Proceedings of the
                                                                     22nd International Conference on Multimedia, Orlando,
3.1   Submitted runs                                                 Florida, November 2014.
   We submitted five runs for both tasks: the violence de-       [2] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet
tection task and the induced affect detection task. For the          classification with deep convolutional neural networks.
first run we combined the audio features with a nonlinear            In Conference on Neural Information Processing
SVM classifier. For the second run, we combined several              Systems (NIPS), 2012.
visual features (BoVW-ColorSIFT, CENTRIST histograms             [3] I. Mironică, I. Duţă, B. Ionescu, and N. Sebe. A
and CNN features) with nonlinear SVM classifier. The next            Modified Vector of Locally Aggregated Descriptors
    Approach for Fast Video Classification. Multimedia
    Tools and Applications (MTAP), 2015.
[4] I. Mironică, J. Uijlings, N. Rostamzadeh, B. Ionescu,
    and N. Sebe. Time Matters! Capturing Variation in
    Time in Video using Fisher Kernels. In ACM
    Multimedia, Barcelona, Spain, 21-25 October 2013.
[5] K. Seyerlehner, G. Widmer, M. Schedl, and P. Knees.
    Automatic Music Tag Classification based on
    Block-Level Features. In Proceedings of the 7th Sound
    and Music Computing Conference (SMC 2010),
    Barcelona, Spain, July 2010.
[6] M. Sjöberg, Y. Baveye, H. Wang, V. L. Quang,
    B. Ionescu, E. Dellandréa, M. Schedl, C.-H. Demarty,
    and L. Chen. The MediaEval 2015 Affective Impact of
    Movies Task. In MediaEval 2015 Workshop, Wurzen,
    Germany, September 14-15 2015.
[7] J. Uijlings, I. Duta, E. Sangineto, and N. Sebe. Video
    classification with densely extracted hog/hof/mbh
    features: an evaluation of the accuracy/computational
    efficiency trade-off. International Journal of Multimedia
    Information Retrieval, pages 1–12, 2014.
[8] K. E. A. van de Sande, T. Gevers, and C. G. M. Snoek.
    Evaluating color descriptors for object and scene
    recognition. IEEE Transactions on Pattern Analysis
    and Machine Intelligence (PAMI), 32(9):1582–1596,
    2010.
[9] J. Wu and J. M. Rehg. CENTRIST: A visual descriptor
    for scene categorization. IEEE Transactions on Pattern
    Analysis and Machine Intelligence (PAMI),
    33(8):1489–1501, 2011.

</pre>