1. INTRODUCTION

FAR at MediaEval 2013 Violent Scenes Detection: Concept-based Violent Scenes Detection in Movies

Mats Sjöberg

mats.sjoberg@aalto.fi 0

Bogdan Ionescu,

bionescu@imag.pub.ro 3

Jan Schlüter

jan.schlueter@ofai.at 1

Markus Schedl

markus.schedl@jku.at 2 0 Aalto University , Espoo , Finland 1 Austrian Research Institute, for Artificial Intelligence , Vienna , Austria 2 Johannes Kepler University , Linz , Austria 3 University Politehnica of , Bucharest , Romania

2013

18 19

The MediaEval 2013 A ect Task challenged participants to automatically nd violent scenes in a set of popular movies. We propose to rst predict a set of mid-level concepts from low-level visual and auditory features, then fuse the concept predictions and features to detect violent content. We deliberately restrict ourselves to simple general-purpose features with limited temporal context and a generic neural network classi er. The system used this year is largely based on the one successfully employed by our group in the 2012 task, with some improvements based on our experience from last year. Our best-performing run with regard to the o cial metric received a MAP@100 of 49.6%.

1. INTRODUCTION

The MediaEval 2013 A ect Task [ 1 ] challenged participants to develop algorithms for nding violent scenes in popular movies from DVD content based on video, audio and subtitles. The organizers provided a training set of 18 movies with frame-wise annotations of segments containing physical violence as well as several violence-related concepts (e.g. blood or re), and a test set of 7 unannotated movies.

The system used by our group this year is largely based on the one successfully employed by us in the 2012 edition of the violent scenes detection task [ 4 ]. In this year we have tried new descriptor combinations, and tweaked the neural network training parameters based on experiments performed with the 2012 task setup.

METHOD

Our system builds on a set of visual and auditory features, employing the same type of classi er at di erent stages to obtain a violence score for each frame of an input video. The setup is largely the same as in 2012 [ 4 ]. 2.1

Feature set

Visual (93 dimensions): For each video frame, we extract an 81-dimensional Histogram of Oriented Gradients (HoG), an 11-dimensional Color Naming Histogram [ 6 ] and a visual activity value. The latter is obtained by lowering the threshold of the cut detector in [ 3 ] such that it becomes overly sensitive, then counting the number of detections in a 2-second time window centered on the current frame.

Auditory (98 dimensions): In addition, we extract a set of low-level auditory features as used by [ 5 ]: Linear Predictive Coe cients (LPCs), Line Spectral Pairs (LSPs), MelFrequency Cepstral Coe cients (MFCCs), Zero-Crossing Rate (ZCR), and spectral centroid, ux, rollo , and kurtosis, augmented with the variance of each feature over a halfsecond time window. We use frame sizes of 40 ms without overlap to make alignment with the 25-fps video frames trivial. 2.2

Classifier

For classi cation, we use multi-layer perceptrons with a single hidden layer of 512 units and one or multiple output units. All units use the logistic sigmoid transfer function.

We normalize the input data by subtracting the mean and dividing by the standard deviation of each input dimension.

Training is performed by backpropagating cross-entropy error, using random dropouts to improve generalization. We follow the dropout scheme of [2, Sec. A.1] with minor modications: all weights are initialized to zero, mini-batches are 900 samples, the learning rate starts at 5:0, momentum is increased from 0:45 to 0:9 between epochs 10 and 20 and we train for 100 epochs only. These settings worked well in experiments with the 2012 training/testing split. In particular we increased the learning rate from what was used in 2012, because it improved performance. 2.3

Fusion scheme

As last year [ 4 ], we use the concept annotations as a stepping stone for predicting violence: We train a separate classi er for each of 10 di erent concepts on the visual, auditory or both feature sets, then train the nal violence predictor using both feature sets and all concept predictions as inputs. For comparison, we also train classi ers to predict violence just from the features or just from the concepts.

EXPERIMENTAL RESULTS Concept prediction

For the training set of 18 movies, each video frame was annotated with the 10 di erent concepts as detailed in [ 1 ]. We divide the concepts into visual, auditory and audiovisual categories, depending on which low-level feature domains we think are relevant for each. Next, we train and evaluate a neural network for each of the concepts, employing leaveone-movie-out cross-validation. The evaluation results are very similar to our experiments in 2012 [4, Sec. 3.1], which is not surprising since the training set has only been supplemented with 20% new movies. For example, rearms and re perform well, while carchase performs badly. 3.2

Violence prediction

Next, we train a frame-wise violence predictor, using visual and auditory low-level features, as well as the concept predictions, as input. Training requires inputs that are similar to those that will be used in the testing phase, thus using the concept ground-truth for training will not work. Instead we use the concept prediction cross-validation outputs on the training set (see previous section) as a more realistic input source { in this way the system can learn which concept predictors to rely on. 3.3

Evaluation results

We submitted ve runs for sub task 1, i.e., the objective violence de nition. Due to time constraints we were not able to prepare any runs for sub task 2 which used the subjective violence de nition. One of our runs was a segment-level run (run5), which forms segments of consecutive frames that our predictor tagged as violent or non-violent. The remaining four runs are shot-level (from run1 to run4), which use the shot boundaries provided by the task organizers. For each run, each partition (segment or shot) is assigned a violence score corresponding to the highest predictor output for any frame within the segment. The segments are then tagged as violent or non-violent depending on whether their violence score exceeds a certain threshold. We used the same thresholds as used by our system in 2012, which were determined by cross-validation in the training set of that year.

Table 1 details the results for all our runs. The rst ve lines show our runs submitted to the o cial evaluation. The rst four are shot-level runs, the fth our single segmentlevel run. The next three lines are additional uno cial runs that we evaluated ourselves. The second column indicates which input features were used, 'a' for auditory, 'v' for visual, and 'c' for concept predictions. The auditory features achieved the highest MAP@100 result, with no gains being provided by the other modalities.

For our submissions we reused the thresholds from [ 4 ]. Unfortunately, this gave a very imbalanced precision and recall for the concept-only submission (run 2), making it di cult to compare to our other runs. To better judge the relative performance of our submissions, Table 1 reports precision, recall and F-score for the threshold maximizing the F-score. Under this metric, the combination of auditory features and concept predictions gives the best result, but di erences between most runs are quite small.

Table 2 shows the movie speci c results for each of our submitted shot-level runs. Despite the bad threshold on run2 it performs very well on Pulp Fiction. The movie \Legally Blond" had very few violent scenes and these were hard to detect with any of our runs. 4.

CONCLUSIONS

Our results show that violence detection can be done fairly well using general-purpose features and generic neural network classi ers, without engineering domain-speci c features. While auditory features give the best results, using mid-level concepts can give small overall gains, and more pronounced gains for particular movies. 5.

[1]

Demarty ,

Penet ,

Schedl ,

Ionescu ,

Quang , and

Jiang . The MediaEval 2013 A ect Task: Violent Scenes Detection . In MediaEval 2013 Workshop, Barcelona, Spain, October 18 -19 2013 .

[2]

Hinton ,

Srivastava ,

Krizhevsky , I. Sutskever , and

Salakhutdinov . Improving neural networks by preventing co-adaptation of feature detectors . arXiv, 2012 .

[3]

Ionescu ,

Buzuloiu ,

Lambert , and

Coquin . Improved Cut Detection for the Segmentation of Animation Movies . In IEEE ICASSP, France, 2006 .

[4]

Ionescu , J. Schluter, I. Mironica, and

Schedl . A naive mid-level concept-based fusion approach to violence detection in hollywood movies . In Proceedings of the 3rd ACM conference on International conference on multimedia retrieval , ICMR '13 , pages 215 { 222 , New York, NY, USA, 2013 . ACM.

[5]

Liu ,

Xie , and

Meng . Classi cation of music and speech in mandarin news broadcasts . In Proc. of the 9th Nat. Conf. on Man-Machine Speech Communication (NCMMSC) , Huangshan, Anhui, China, 2007 .

[6] J. van de Weijer , C.

Schmid , J.

Verbeek , and D.

Larlus . Learning color names for real-world applications . IEEE Trans. on Image Processing , 18 ( 7 ): 1512 { 1523 , 2009 .