=Paper=
{{Paper
|id=None
|storemode=property
|title=FAR at MediaEval 2013 Violent Scenes Detection: Concept-based Violent Scenes Detection in Movies
|pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_10.pdf
|volume=Vol-1043
|dblpUrl=https://dblp.org/rec/conf/mediaeval/SjobergSIS13
}}
==FAR at MediaEval 2013 Violent Scenes Detection: Concept-based Violent Scenes Detection in Movies==
FAR at MediaEval 2013 Violent Scenes Detection:
Concept-based Violent Scenes Detection in Movies
Mats Sjöberg Jan Schlüter
Aalto University, Austrian Research Institute
Espoo, Finland for Artificial Intelligence,
mats.sjoberg@aalto.fi Vienna, Austria
jan.schlueter@ofai.at
Bogdan Ionescu, Markus Schedl
University Politehnica of Johannes Kepler University,
Bucharest, Romania Linz, Austria
bionescu@imag.pub.ro markus.schedl@jku.at
ABSTRACT 2.1 Feature set
The MediaEval 2013 Affect Task challenged participants to Visual (93 dimensions): For each video frame, we extract
automatically find violent scenes in a set of popular movies. an 81-dimensional Histogram of Oriented Gradients (HoG),
We propose to first predict a set of mid-level concepts from an 11-dimensional Color Naming Histogram [6] and a vi-
low-level visual and auditory features, then fuse the concept sual activity value. The latter is obtained by lowering the
predictions and features to detect violent content. We delib- threshold of the cut detector in [3] such that it becomes
erately restrict ourselves to simple general-purpose features overly sensitive, then counting the number of detections in
with limited temporal context and a generic neural network a 2-second time window centered on the current frame.
classifier. The system used this year is largely based on the Auditory (98 dimensions): In addition, we extract a set
one successfully employed by our group in the 2012 task, of low-level auditory features as used by [5]: Linear Predic-
with some improvements based on our experience from last tive Coefficients (LPCs), Line Spectral Pairs (LSPs), Mel-
year. Our best-performing run with regard to the official Frequency Cepstral Coefficients (MFCCs), Zero-Crossing Ra-
metric received a MAP@100 of 49.6%. te (ZCR), and spectral centroid, flux, rolloff, and kurtosis,
augmented with the variance of each feature over a half-
second time window. We use frame sizes of 40 ms without
Keywords overlap to make alignment with the 25-fps video frames triv-
Violent Scenes Detection, Concept Detection, Supervised ial.
Learning, Neural Networks, MediaEval 2013
2.2 Classifier
1. INTRODUCTION For classification, we use multi-layer perceptrons with a
single hidden layer of 512 units and one or multiple output
The MediaEval 2013 Affect Task [1] challenged partici-
units. All units use the logistic sigmoid transfer function.
pants to develop algorithms for finding violent scenes in
We normalize the input data by subtracting the mean and
popular movies from DVD content based on video, audio
dividing by the standard deviation of each input dimension.
and subtitles. The organizers provided a training set of 18
Training is performed by backpropagating cross-entropy
movies with frame-wise annotations of segments containing
error, using random dropouts to improve generalization. We
physical violence as well as several violence-related concepts
follow the dropout scheme of [2, Sec. A.1] with minor modi-
(e.g. blood or fire), and a test set of 7 unannotated movies.
fications: all weights are initialized to zero, mini-batches are
The system used by our group this year is largely based on
900 samples, the learning rate starts at 5.0, momentum is
the one successfully employed by us in the 2012 edition of the
increased from 0.45 to 0.9 between epochs 10 and 20 and we
violent scenes detection task [4]. In this year we have tried
train for 100 epochs only. These settings worked well in ex-
new descriptor combinations, and tweaked the neural net-
periments with the 2012 training/testing split. In particular
work training parameters based on experiments performed
we increased the learning rate from what was used in 2012,
with the 2012 task setup.
because it improved performance.
2. METHOD 2.3 Fusion scheme
Our system builds on a set of visual and auditory features, As last year [4], we use the concept annotations as a step-
employing the same type of classifier at different stages to ping stone for predicting violence: We train a separate clas-
obtain a violence score for each frame of an input video. The sifier for each of 10 different concepts on the visual, auditory
setup is largely the same as in 2012 [4]. or both feature sets, then train the final violence predictor
using both feature sets and all concept predictions as inputs.
For comparison, we also train classifiers to predict violence
Copyright is held by the author/owner(s).
just from the features or just from the concepts.
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain
3. EXPERIMENTAL RESULTS
Table 1: Results for different features
feat. prec. rec. max F-sc. MAP@100
3.1 Concept prediction run1 a 34% 48% 40.0% 49.6%
For the training set of 18 movies, each video frame was run2 c 35% 51% 41.5% 30.4%
annotated with the 10 different concepts as detailed in [1]. run3 av 34% 52% 41.4% 39.6%
We divide the concepts into visual, auditory and audiovisual run4 avc 35% 48% 40.9% 40.4%
categories, depending on which low-level feature domains we run5 avc 23% 28% 25.8% 35.0%
think are relevant for each. Next, we train and evaluate a v 20% 50% 29.0% 23.9%
neural network for each of the concepts, employing leave- ac 37% 47% 41.6% 47.4%
one-movie-out cross-validation. The evaluation results are vc 22% 53% 31.0% 28.5%
very similar to our experiments in 2012 [4, Sec. 3.1], which
is not surprising since the training set has only been supple-
mented with 20% new movies. For example, firearms and Table 2: Movie specific results (MAP@100)
fire perform well, while carchase performs badly. movie run1 run2 run3 run4
3.2 Violence prediction Fantastic Four 1 73.1% 63.0% 60.5% 69.7%
Fargo 55.5% 0.0% 57.0% 60.6%
Next, we train a frame-wise violence predictor, using vi-
Forrest Gump 38.9% 19.3% 35.8% 37.0%
sual and auditory low-level features, as well as the concept
Legally Blond 0.0% 0.0% 4.3% 4.4%
predictions, as input. Training requires inputs that are simi-
Pulp Fiction 62.0% 90.9% 51.4% 52.1%
lar to those that will be used in the testing phase, thus using
The God Father 1 84.7% 39.5% 49.3% 47.7%
the concept ground-truth for training will not work. Instead
The Pianist 32.9% 0.0% 19.3% 11.2%
we use the concept prediction cross-validation outputs on
the training set (see previous section) as a more realistic in-
put source – in this way the system can learn which concept
“Legally Blond” had very few violent scenes and these were
predictors to rely on.
hard to detect with any of our runs.
3.3 Evaluation results
We submitted five runs for sub task 1, i.e., the objective 4. CONCLUSIONS
violence definition. Due to time constraints we were not able Our results show that violence detection can be done fairly
to prepare any runs for sub task 2 which used the subjective well using general-purpose features and generic neural net-
violence definition. One of our runs was a segment-level run work classifiers, without engineering domain-specific features.
(run5), which forms segments of consecutive frames that our While auditory features give the best results, using mid-level
predictor tagged as violent or non-violent. The remaining concepts can give small overall gains, and more pronounced
four runs are shot-level (from run1 to run4), which use the gains for particular movies.
shot boundaries provided by the task organizers. For each
run, each partition (segment or shot) is assigned a violence 5. REFERENCES
score corresponding to the highest predictor output for any
[1] C. Demarty, C. Penet, M. Schedl, B. Ionescu,
frame within the segment. The segments are then tagged as
V. Quang, and Y. Jiang. The MediaEval 2013 Affect
violent or non-violent depending on whether their violence
Task: Violent Scenes Detection. In MediaEval 2013
score exceeds a certain threshold. We used the same thresh-
Workshop, Barcelona, Spain, October 18-19 2013.
olds as used by our system in 2012, which were determined
[2] G. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever,
by cross-validation in the training set of that year.
and R. Salakhutdinov. Improving neural networks by
Table 1 details the results for all our runs. The first five
preventing co-adaptation of feature detectors. arXiv,
lines show our runs submitted to the official evaluation. The
2012.
first four are shot-level runs, the fifth our single segment-
level run. The next three lines are additional unofficial runs [3] B. Ionescu, V. Buzuloiu, P. Lambert, and D. Coquin.
that we evaluated ourselves. The second column indicates Improved Cut Detection for the Segmentation of
which input features were used, ’a’ for auditory, ’v’ for vi- Animation Movies. In IEEE ICASSP, France, 2006.
sual, and ’c’ for concept predictions. The auditory features [4] B. Ionescu, J. Schlüter, I. Mironica, and M. Schedl. A
achieved the highest MAP@100 result, with no gains being naive mid-level concept-based fusion approach to
provided by the other modalities. violence detection in hollywood movies. In Proceedings
For our submissions we reused the thresholds from [4]. Un- of the 3rd ACM conference on International conference
fortunately, this gave a very imbalanced precision and recall on multimedia retrieval, ICMR ’13, pages 215–222, New
for the concept-only submission (run 2), making it difficult York, NY, USA, 2013. ACM.
to compare to our other runs. To better judge the relative [5] C. Liu, L. Xie, and H. Meng. Classification of music and
performance of our submissions, Table 1 reports precision, speech in mandarin news broadcasts. In Proc. of the 9th
recall and F-score for the threshold maximizing the F-score. Nat. Conf. on Man-Machine Speech Communication
Under this metric, the combination of auditory features and (NCMMSC), Huangshan, Anhui, China, 2007.
concept predictions gives the best result, but differences be- [6] J. van de Weijer, C. Schmid, J. Verbeek, and D. Larlus.
tween most runs are quite small. Learning color names for real-world applications. IEEE
Table 2 shows the movie specific results for each of our Trans. on Image Processing, 18(7):1512–1523, 2009.
submitted shot-level runs. Despite the bad threshold on
run2 it performs very well on Pulp Fiction. The movie