=Paper= {{Paper |id=None |storemode=property |title=FAR at MediaEval 2013 Violent Scenes Detection: Concept-based Violent Scenes Detection in Movies |pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_10.pdf |volume=Vol-1043 |dblpUrl=https://dblp.org/rec/conf/mediaeval/SjobergSIS13 }} ==FAR at MediaEval 2013 Violent Scenes Detection: Concept-based Violent Scenes Detection in Movies== https://ceur-ws.org/Vol-1043/mediaeval2013_submission_10.pdf
         FAR at MediaEval 2013 Violent Scenes Detection:
        Concept-based Violent Scenes Detection in Movies

                                     Mats Sjöberg                       Jan Schlüter
                                     Aalto University,           Austrian Research Institute
                                     Espoo, Finland               for Artificial Intelligence,
                                 mats.sjoberg@aalto.fi                 Vienna, Austria
                                                                    jan.schlueter@ofai.at
                                  Bogdan Ionescu,                      Markus Schedl
                                University Politehnica of        Johannes Kepler University,
                                 Bucharest, Romania                    Linz, Austria
                                  bionescu@imag.pub.ro              markus.schedl@jku.at


ABSTRACT                                                         2.1    Feature set
The MediaEval 2013 Affect Task challenged participants to           Visual (93 dimensions): For each video frame, we extract
automatically find violent scenes in a set of popular movies.    an 81-dimensional Histogram of Oriented Gradients (HoG),
We propose to first predict a set of mid-level concepts from     an 11-dimensional Color Naming Histogram [6] and a vi-
low-level visual and auditory features, then fuse the concept    sual activity value. The latter is obtained by lowering the
predictions and features to detect violent content. We delib-    threshold of the cut detector in [3] such that it becomes
erately restrict ourselves to simple general-purpose features    overly sensitive, then counting the number of detections in
with limited temporal context and a generic neural network       a 2-second time window centered on the current frame.
classifier. The system used this year is largely based on the       Auditory (98 dimensions): In addition, we extract a set
one successfully employed by our group in the 2012 task,         of low-level auditory features as used by [5]: Linear Predic-
with some improvements based on our experience from last         tive Coefficients (LPCs), Line Spectral Pairs (LSPs), Mel-
year. Our best-performing run with regard to the official        Frequency Cepstral Coefficients (MFCCs), Zero-Crossing Ra-
metric received a MAP@100 of 49.6%.                              te (ZCR), and spectral centroid, flux, rolloff, and kurtosis,
                                                                 augmented with the variance of each feature over a half-
                                                                 second time window. We use frame sizes of 40 ms without
Keywords                                                         overlap to make alignment with the 25-fps video frames triv-
Violent Scenes Detection, Concept Detection, Supervised          ial.
Learning, Neural Networks, MediaEval 2013
                                                                 2.2    Classifier
1.   INTRODUCTION                                                   For classification, we use multi-layer perceptrons with a
                                                                 single hidden layer of 512 units and one or multiple output
   The MediaEval 2013 Affect Task [1] challenged partici-
                                                                 units. All units use the logistic sigmoid transfer function.
pants to develop algorithms for finding violent scenes in
                                                                    We normalize the input data by subtracting the mean and
popular movies from DVD content based on video, audio
                                                                 dividing by the standard deviation of each input dimension.
and subtitles. The organizers provided a training set of 18
                                                                    Training is performed by backpropagating cross-entropy
movies with frame-wise annotations of segments containing
                                                                 error, using random dropouts to improve generalization. We
physical violence as well as several violence-related concepts
                                                                 follow the dropout scheme of [2, Sec. A.1] with minor modi-
(e.g. blood or fire), and a test set of 7 unannotated movies.
                                                                 fications: all weights are initialized to zero, mini-batches are
   The system used by our group this year is largely based on
                                                                 900 samples, the learning rate starts at 5.0, momentum is
the one successfully employed by us in the 2012 edition of the
                                                                 increased from 0.45 to 0.9 between epochs 10 and 20 and we
violent scenes detection task [4]. In this year we have tried
                                                                 train for 100 epochs only. These settings worked well in ex-
new descriptor combinations, and tweaked the neural net-
                                                                 periments with the 2012 training/testing split. In particular
work training parameters based on experiments performed
                                                                 we increased the learning rate from what was used in 2012,
with the 2012 task setup.
                                                                 because it improved performance.

2.   METHOD                                                      2.3    Fusion scheme
  Our system builds on a set of visual and auditory features,       As last year [4], we use the concept annotations as a step-
employing the same type of classifier at different stages to     ping stone for predicting violence: We train a separate clas-
obtain a violence score for each frame of an input video. The    sifier for each of 10 different concepts on the visual, auditory
setup is largely the same as in 2012 [4].                        or both feature sets, then train the final violence predictor
                                                                 using both feature sets and all concept predictions as inputs.
                                                                 For comparison, we also train classifiers to predict violence
Copyright is held by the author/owner(s).
                                                                 just from the features or just from the concepts.
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain
3.    EXPERIMENTAL RESULTS
                                                                           Table 1: Results for different features
                                                                             feat. prec. rec. max F-sc. MAP@100
3.1    Concept prediction                                              run1 a       34% 48%          40.0%       49.6%
   For the training set of 18 movies, each video frame was             run2 c       35% 51%          41.5%       30.4%
annotated with the 10 different concepts as detailed in [1].           run3 av      34% 52%          41.4%       39.6%
We divide the concepts into visual, auditory and audiovisual           run4 avc     35% 48%          40.9%       40.4%
categories, depending on which low-level feature domains we            run5 avc     23% 28%          25.8%       35.0%
think are relevant for each. Next, we train and evaluate a                   v      20% 50%          29.0%       23.9%
neural network for each of the concepts, employing leave-                    ac     37% 47%          41.6%       47.4%
one-movie-out cross-validation. The evaluation results are                   vc     22% 53%          31.0%       28.5%
very similar to our experiments in 2012 [4, Sec. 3.1], which
is not surprising since the training set has only been supple-
mented with 20% new movies. For example, firearms and                   Table 2: Movie specific results (MAP@100)
fire perform well, while carchase performs badly.                      movie             run1     run2    run3 run4
3.2    Violence prediction                                             Fantastic Four 1 73.1% 63.0% 60.5% 69.7%
                                                                       Fargo            55.5%     0.0% 57.0% 60.6%
   Next, we train a frame-wise violence predictor, using vi-
                                                                       Forrest Gump     38.9% 19.3% 35.8% 37.0%
sual and auditory low-level features, as well as the concept
                                                                       Legally Blond     0.0%     0.0%   4.3%  4.4%
predictions, as input. Training requires inputs that are simi-
                                                                       Pulp Fiction     62.0% 90.9% 51.4% 52.1%
lar to those that will be used in the testing phase, thus using
                                                                       The God Father 1 84.7% 39.5% 49.3% 47.7%
the concept ground-truth for training will not work. Instead
                                                                       The Pianist      32.9%     0.0% 19.3% 11.2%
we use the concept prediction cross-validation outputs on
the training set (see previous section) as a more realistic in-
put source – in this way the system can learn which concept
                                                                  “Legally Blond” had very few violent scenes and these were
predictors to rely on.
                                                                  hard to detect with any of our runs.
3.3    Evaluation results
   We submitted five runs for sub task 1, i.e., the objective     4.    CONCLUSIONS
violence definition. Due to time constraints we were not able       Our results show that violence detection can be done fairly
to prepare any runs for sub task 2 which used the subjective      well using general-purpose features and generic neural net-
violence definition. One of our runs was a segment-level run      work classifiers, without engineering domain-specific features.
(run5), which forms segments of consecutive frames that our       While auditory features give the best results, using mid-level
predictor tagged as violent or non-violent. The remaining         concepts can give small overall gains, and more pronounced
four runs are shot-level (from run1 to run4), which use the       gains for particular movies.
shot boundaries provided by the task organizers. For each
run, each partition (segment or shot) is assigned a violence      5.    REFERENCES
score corresponding to the highest predictor output for any
                                                                  [1] C. Demarty, C. Penet, M. Schedl, B. Ionescu,
frame within the segment. The segments are then tagged as
                                                                      V. Quang, and Y. Jiang. The MediaEval 2013 Affect
violent or non-violent depending on whether their violence
                                                                      Task: Violent Scenes Detection. In MediaEval 2013
score exceeds a certain threshold. We used the same thresh-
                                                                      Workshop, Barcelona, Spain, October 18-19 2013.
olds as used by our system in 2012, which were determined
                                                                  [2] G. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever,
by cross-validation in the training set of that year.
                                                                      and R. Salakhutdinov. Improving neural networks by
   Table 1 details the results for all our runs. The first five
                                                                      preventing co-adaptation of feature detectors. arXiv,
lines show our runs submitted to the official evaluation. The
                                                                      2012.
first four are shot-level runs, the fifth our single segment-
level run. The next three lines are additional unofficial runs    [3] B. Ionescu, V. Buzuloiu, P. Lambert, and D. Coquin.
that we evaluated ourselves. The second column indicates              Improved Cut Detection for the Segmentation of
which input features were used, ’a’ for auditory, ’v’ for vi-         Animation Movies. In IEEE ICASSP, France, 2006.
sual, and ’c’ for concept predictions. The auditory features      [4] B. Ionescu, J. Schlüter, I. Mironica, and M. Schedl. A
achieved the highest MAP@100 result, with no gains being              naive mid-level concept-based fusion approach to
provided by the other modalities.                                     violence detection in hollywood movies. In Proceedings
   For our submissions we reused the thresholds from [4]. Un-         of the 3rd ACM conference on International conference
fortunately, this gave a very imbalanced precision and recall         on multimedia retrieval, ICMR ’13, pages 215–222, New
for the concept-only submission (run 2), making it difficult          York, NY, USA, 2013. ACM.
to compare to our other runs. To better judge the relative        [5] C. Liu, L. Xie, and H. Meng. Classification of music and
performance of our submissions, Table 1 reports precision,            speech in mandarin news broadcasts. In Proc. of the 9th
recall and F-score for the threshold maximizing the F-score.          Nat. Conf. on Man-Machine Speech Communication
Under this metric, the combination of auditory features and           (NCMMSC), Huangshan, Anhui, China, 2007.
concept predictions gives the best result, but differences be-    [6] J. van de Weijer, C. Schmid, J. Verbeek, and D. Larlus.
tween most runs are quite small.                                      Learning color names for real-world applications. IEEE
   Table 2 shows the movie specific results for each of our           Trans. on Image Processing, 18(7):1512–1523, 2009.
submitted shot-level runs. Despite the bad threshold on
run2 it performs very well on Pulp Fiction. The movie