=Paper= {{Paper |id=Vol-1263/paper66 |storemode=property |title=FAR at MediaEval 2014 Violent Scenes Detection: A Concept-based Fusion Approach |pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_66.pdf |volume=Vol-1263 |dblpUrl=https://dblp.org/rec/conf/mediaeval/SjobergMSI14 }} ==FAR at MediaEval 2014 Violent Scenes Detection: A Concept-based Fusion Approach== https://ceur-ws.org/Vol-1263/mediaeval2014_submission_66.pdf
          FAR at MediaEval 2014 Violent Scenes Detection:
                 A Concept-based Fusion Approach

                                     Mats Sjöberg                      Ionuţ Mironică,
                                  University of Helsinki,          University Politehnica of
                                         Finland                    Bucharest, Romania
                               mats.sjoberg@helsinki.fi             imironica@imag.pub.ro
                                    Markus Schedl                      Bogdan Ionescu,
                              Johannes Kepler University,          University Politehnica of
                                    Linz, Austria                   Bucharest, Romania
                                  markus.schedl@jku.at              bionescu@imag.pub.ro


ABSTRACT                                                         first step consists of training the system using ground truth
The MediaEval 2014 Violent Scenes Detection task chal-           data. Training is performed at two levels. At mid-level, a
lenged participants to automatically find violent scenes in      bank of classifiers is trained using ground truth related to
a set of videos. We propose to first predict a set of mid-       concepts that are usually present in the violent scenes, e.g.,
level concepts from low-level visual and auditory features,      presence of “fire”, presence of “gunshots”, or “gory” scenes.
then fuse the concept predictions and features to detect vio-    Then, high-level violence detection is ensured by a final clas-
lent content. With the objective of obtaining a higly generic    sifier that is fed either with the previous concept predictions
approach, we deliberately restrict ourselves to use simple       and/or the low-level content descriptors. The violence clas-
general-purpose descriptors with limited temporal context        sifier is also trained on the provided ground truth for the
and a common neural network classifier. The system used          violent segments. The final step consists of classifying the
this year is largely based on the one successfully employed      new unlabeled data (e.g. the test set) which is achieved by
by our group in 2012 and 2013, with some improvements            employing the previously trained multi-classifier framework.
and updated features. Our best-performing run with regard        These steps are detailed in the following.
to the official metric received a MAP2014 of 45.06% in the
main task and 66.38% in the generalization task.
                                                                 2.1    Feature set
                                                                   Visual (225 dimensions): For each video frame, we ex-
                                                                 tract several standard color and texture-based descriptors,
1.   INTRODUCTION                                                such as: Color Naming Histogram, Color Moments, Local
   The MediaEval 2014 Violent Scenes Detection task [4]          Binary Patterns, Color Structure Descriptor, and Gray Level
challenged participants to develop algorithms for finding vi-    Run Length Matrix. Also, we compute the Histogram of Ori-
olent scenes in two settings: popular Hollywood-style movies     ented Gradients, that exploits the local object appearance
(main task), and YouTube web videos (generalization task).       and shape within a frame by using the distribution of edge
The organizers provided a training set of 24 movies with         orientations. For a more detailed description of the visual
frame-wise annotations of segments containing physical vio-      features, see [1].
lence as well as several violence-related concepts (e.g. blood     Auditory (29 dimensions): In addition, we extract a
or fire) for part of the data. The test set consisted of 7       set of low-level auditory features: amplitude envelop, root-
movies for the main task, and 86 short web videos for the        mean-square energy, zero-crossing rate, band energy ration,
generalization task.                                             spectral centroid, spectral flux, bandwidth, and Mel-frequency
   Our system this year is largely based on the one success-     cepstral coefficients. We compute the features on frames of
fully employed by us in 2012 [3] and 2013 [5]. We tackle         40 ms without overlap to make alignment with the 25-fps
the task as a machine learning problem, employing general-       video frames trivial.
purpose features and a neural network classifier. The main
novel contribution is an updated set of low-level features.      2.2    Classifier
                                                                    For classification, we use multi-layer perceptrons with a
2.   METHOD                                                      single hidden layer of 512 units and one or multiple output
   Our system builds on a set of visual and auditory fea-        units. All units use the logistic sigmoid transfer function.
tures, employing the same type of neural network classi-         The input data is normalized by subtracting the mean and
fier at different stages to obtain a violence score for each     dividing by the standard deviation of each input dimension.
frame of an input video. First, we perform feature extrac-       Training is performed by backpropagating cross-entropy er-
tion at the frame level. The resulting data is then fed into     ror, using random dropouts to improve generalization. We
a multi-classifier framework that operates in two steps. The     follow the dropout scheme of [2, Sec. A.1] with some minor
                                                                 modifications to the parameters.
                                                                    For the concept training set of 18 movies, each video frame
Copyright is held by the author/owner(s).                        was annotated with the 10 different concepts as detailed in
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain   [4]. We divide the concepts into visual, auditory and au-
                                                                  the main task, and fusing them with the auditory features
       Table 1: Results for different features (%)                did not improve the results above the audio-only result. In
             feat.  prec. recall F-score MAP2014                  contrast, in the generalization task all combinations perform
     main 1 a      28.04 71.26      40.24      45.06              similarly, except for the concepts which have a clearly better
     main 2 v      17.88 93.62      30.03       32.64             result.
     main 3 c      28.65 44.94      34.99       25.02                Another observation is that all results have a strong im-
     main 4 av     19.34 77.18      30.92       31.96             balance between precision and recall. Our analysis indicates
     main 5 ac     29.16 63.08      39.88       40.77             that this is not due to a poor selection of the violence judg-
     gen 1   a     46.04 85.81      59.93       57.81             ment threshold (in fact our thresholds are relatively close
     gen 2   v     43.42 86.05      57.72       59.63             to optimal), but instead due to the new MAP2014 measure
     gen 3   c     49.68 85.80      62.92      66.38              favoring high recall.
     gen 4   av    44.76 83.38      58.25       58.07                Table 2 shows the movie specific results for each of our
     gen 5   ac    46.86 83.94      60.14       60.92             main task runs. Interestingly auditory features perform par-
                                                                  ticularly well on the anime movie “Ghost in the Shell”, while
                                                                  the visual features perform strongly on “8 Mile”, a drama
   Table 2: Movie specific results, MAP2014 (%)                   movie with more realistic violence such as fist fights etc.
 movie (main task)      a     v       c    av    ac               “Jumanji” and “Braveheart” are the two movies with the
 Ghost in the Shell 82.67 20.38 25.26 23.72 67.30                 poorest results. This can perhaps be explained by the fact
 Braveheart         29.01 36.26 17.22 22.65 24.79                 that they differ from the training set more than the other
 Jumanji            29.27 16.13    2.71 14.07 23.70               movies. In particular “Braveheart” depicts brutal medieval
 Desperado          37.78 42.58 18.25 34.85 27.65                 fights, which are not represented in the training set.
 V for Vendetta     48.48 24.98 36.80 45.07 49.10
 Terminator 2       56.17 27.27 48.82 43.25 55.26                 4.   CONCLUSIONS
 8 Mile             32.03 60.84 26.09 40.08 37.61
                                                                     Our results show that violence detection can be done well
                                                                  using general-purpose features and generic neural network
                                                                  classifiers, without engineering domain-specific features. The
diovisual categories, depending on which low-level feature
                                                                  selection of feature modalities is highly dependent on the
domains we think are relevant for each. Next, we train and
                                                                  type of material, for Hollywood-style movies auditory fea-
evaluate a neural network for each of the concepts, employ-
                                                                  tures performed best, while concepts are useful for the more
ing leave-one-movie-out cross-validation.
                                                                  mixed style found in YouTube videos. Based on the results,
2.3     Fusion scheme                                             we can also conclude that our violence detection framework
                                                                  generalises well: even though it was trained on only fea-
   The final violence predictor is trained using both low-level
                                                                  ture length movies it performs accurate violence detection
features and all mid-level concept predictions as inputs. For
                                                                  on YouTube videos as well.
comparison, we also train classifiers to predict violence just
from the features or just from the concepts.
   Training the violence detector requires inputs that are        Acknowledgements
similar to those that will be used in the testing phase, thus     We received support by the Academy of Finland, grants
using the concept ground-truth for training will not work.        no. 255745 and 251170, ESF POSDRU/159/1.5/S/132395
Instead we use the concept prediction cross-validation out-       InnoRESEARCH programme, the EU-FP7 project no. 601166
puts on the training set (see previous section) as a more         and the Austrian Science Fund (FWF): P22856 and P25655.
realistic input source – in this way the system can learn
which concept predictors to rely on.
   The final violence prediction score is generated by apply-
                                                                  5.   REFERENCES
ing a sliding median filter as temporal smoothing. We used        [1] B. Boteanu, I. Mironică, and B. Ionescu. A relevance
a filter length of 5 seconds (125 frames), this was selected          feedback perspective to image search result
from experimenting in the training set. The final detection           diversification. In Proc. ICCP, 2014.
as violent or non-violent is generated by thresholding the        [2] G. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever,
prediction score. The thresholds were determined by max-              and R. Salakhutdinov. Improving neural networks by
imizing the MAP2014 performance measure in the training               preventing co-adaptation of feature detectors. arXiv,
set using cross-validation.                                           2012.
                                                                  [3] B. Ionescu, J. Schlüter, I. Mironică, and M. Schedl. A
3.    EXPERIMENTAL RESULTS                                            naive mid-level concept-based fusion approach to
                                                                      violence detection in hollywood movies. In Proc. ICMR,
   We submitted five runs for both the main task and the
                                                                      pages 215–222, New York, NY, USA, 2013. ACM.
generalization task. Table 1 details the results for all our
                                                                  [4] M. Sjöberg, B. Ionescu, Y. Jiang, V. Quang, M. Schedl,
runs. The first five lines show our runs submitted to the
                                                                      and C. Demarty. The MediaEval 2014 Affect Task:
main task, the next five lines are those for the generaliza-
                                                                      Violent Scenes Detection. In MediaEval 2014
tion task. The second column indicates which input features
                                                                      Workshop, Barcelona, Spain, October 16-17 2014.
were used, ’a’ for auditory, ’v’ for visual, and ’c’ for con-
cept predictions. Multiple feature modalities indicate that       [5] M. Sjöberg, J. Schlüter, B. Ionescu, and M. Schedl.
they were integrated using early fusion. For the main task,           FAR at MediaEval 2013 violent scenes detection:
the auditory features achieved the highest MAP2014 result.            Concept-based violent scenes detection in movies. In
Concept detectors and visual features performed poorly in             Proc. MediaEval Workshop, Barcelona, Spain, 2013.