Automatic Violence Scenes Detection: A Multi-Modal
                          Approach

                         Gabin Gninkoun                                      Mohammad Soleymani
                  Computer Science Department                               Computer Science Department
                     University of Geneva                                      University of Geneva
                          Switzerland                                               Switzerland
                 gabin.gninkoun@gmail.com                               mohammad.soleymani@unige.ch

ABSTRACT                                                           given in the task overview paper [2].
In this working note, we propose a set of features and a clas-
sification scheme for detecting automatically violent scenes       2.    FEATURES AND METHODS
in movies. The features are extracted from audio, video, and
subtitles modalities of the movies. In violent scenes classi-
fication, we found the following features relevant: the short
                                                                   2.1     Proposed Content-Based Features
time audio energy, motion component, and shot words rate.
We classified the shots into violent and non-violent using         2.1.1    Audio-Visual features
naı̈ve Bayesian, Linear Discriminant Analysis (LDA), and           The extracted audio features are: energy entropy, signal am-
Quadratic Discriminant Analysis (QDA) targeting to maxi-           plitude, short time energy, zero crossing rate, spectral flux,
mize the precision of the detection in the first two minutes       spectral rolloff. A more detailed description can be found in
of retrieved content.                                              [3]. In visual modality, we extracted the shot length, the shot
                                                                   motion component, the skewness of the motion vectors and
                                                                   the shot motion content. The description of the technique
Keywords                                                           used to compute the shot motion component is given in [6].
Violence, audio feature extraction, visual feature extraction,
text-based features, subtitles, violence scenes detection, clas-   2.1.2    Text-based features
sification                                                         The subtitles available for all DVDs carry a semantic infor-
                                                                   mation of the movie content. We have parsed the file con-
1.   INTRODUCTION                                                  tent in a set of CaptionsElement where CaptionsElement is
Visual media is nowadays full of violent scenes. Therefore,        an object having the four attributes (num, startTime, stop-
multimedia content is rated to protect minors or warn the          Time, stemmedWords). The attribute num corresponds to
viewers for graphic or inappropriate images.                       the dialogue position in the subtitles file content.
Manual rating of all existing content is not feasible for the      Each dialogue text is first tokenized. The English stops
fast growing digital media. An automatic method that can           words were first removed. Then, we used WordNet [5] to
detect violence in movies, including verbal, semantic and          remove names from the remaining words. Afterwards, we
visual violence can help video on demand services as well as       applied the Porter stemming algorithm [8] on to get the
online multimedia repositories to rate their content.              stemmed words. Two features have been derived from the
In this task, we have used audio, visual and text modali-          text modality: shot words rate (SWR) and shot swearing
ties to detect violent scenes in movies at shot level. Despite     words rate (SSWR). We defined SWR as the estimated amount
its importance, this problem has not been extensively ad-          of words in a shot. Similar to SWR, SSWR corresponds
dressed in literature. Giannakopoulos et al. combined vi-          to the estimated amount of swearing words in a shot. To
sual and audio features to design a multi-modal fusion pro-        compute SSWR, a list of the 341 most currently used swear-
cess [3]. Two individual kNN classifiers (audio based and          ing words was obtained from a swearing words dictionary
visual based) were trained in order to distinguish violence        (http://www.noswearing.com/dictionary) and used in our
and non-violence at segment level. de Souza et al. developed       swearing words detector.
a violence segment detector based on the concept of visual         All the proposed content-based features have been extracted
codebook with usage of a Linear Support Vector Machines            using shot boundaries provided by MediaEval [2]. In total,
(LSVM). The visual codebook was defined using a k-means            we extracted 15 features/statistics from three modalities.
clustering algorithm. The input video data was segmented
into shots, which were converted into bags of visual words         2.2     Discriminant Analysis and Post-Processing
[1]. The current study’s task and its dataset are provided by      Three different classifiers were applied to detect violent shots,
Technicolor for the MediaEval benchmarking initiative 2011.        namely, QDA, LDA and naı̈ve Bayesian classifier. A post-
The details about the task, the dataset and annotations are        processing on the results of QDA was also done to consider
                                                                   the temporal correlation between consecutive shots. The
                                                                   post-processing consists on smoothing the confidence scores
Copyright is held by the author/owner(s).                          for the violent class from QDA using weights found from the
MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy          transition probabilities on the training set.
3.    EXPERIMENTS AND RESULTS
                                                                   Table 3: Violence detection system evaluation re-
According to the requirements, we have generated five differ-      sults for the 5 submitted runs at shot level.
ent runs trying different classifiers with prior probabilities.     Run Precision Recall F-                 MediaEval
These characteristics are listed in Table 1.                                                  measure       cost
                                                                    1       0.174         0.377     0.238           6.522
Table 1: The classifiers and prior probabilities for                2       0.164         0.870     0.276           2.024
five submitted runs. pn is the prior probability for                3       0.183         0.426     0.256           6.049
the non-violent and pp is for violent.                              4       0.178         0.774     0.289           2.838
  Run Classifier                    pn    pp                        5       0.252         0.077     0.119           9.252
 1       LDA                              0.5          0.5
 2       LDA                              0.3          0.7
 3       Naı̈ve Bayesian                  0.5          0.5         based on a detection cost function weighting by false alarms
                                                                   and missed detections rate. The short time energy, the mo-
 4       QDA                              0.5          0.5
                                                                   tion component and the shot words rate are proposed and
 5       QDA + Post-processing            -            -
                                                                   used as relevant features to classify a movie’s shots as vio-
                                                                   lent or non-violent. The proposed methods were unable to
3.1    Evaluation criteria and Classifier selection                detect all the violent scenes without sacrificing the false pos-
The goal of the violence detection in the proposed use case        itive rate. This is due to the fact that the proposed features
scenario is to provide the user with the most violent shots        are not enough to capture all violent actions or events. Auto-
in the movie. We defined an evaluation criteria based on           matic detection of more high level concepts such as scream,
this use case scenario as follows. The detected violent shots      explosion, or blood are needed to improve the detections.
were first ranked based on their confidence scores, the first
2 minutes on the top of the list were set aside as the re-         5.   ACKNOWLEDGEMENTS
trieved content. The precision, recall and F1 score were           This work is supported by the European Community’s Sev-
then computed for the top ranked two minutes shots. We             enth Framework Programme [FP7/2007-2011] under grant
used a K-fold cross validation with K = 11 and different           agreement Petamedia No. 216444.
prior probabilities for each class. The best performance was
achieved using LDA and QDA methods with equal prior                6.   REFERENCES
probabilities for both classes.                                    [1] F. D. M. de Souza, G. C. Chavez, E. A. do Valle Jr.,
                                                                       and A. de A. Araujo. Violence detection in video using
3.2    Post-processing                                                 spatio-temporal features. Graphics, Patterns and
The results of the last run correspond to the post-processing          Images, SIBGRAPI Conference on, 0:224–230, 2010.
of the fourth run. A weighted average of the confidence            [2] C.-H. Demarty, C. Penet, G. Gravier, and
scores was used to smooth the violent shots’ decisions. The            M. Soleymani. The MediaEval 2011 Affect Task:
weights are given in Table 2 where the first row represents            Violent Scenes Detection in Hollywood Movies. In
the probabilities of transition to a violent shot while the four       Working notes Proceeding of Medieval workshop, Pisa,
neighbouring shots are non-violent. The second row of the              Italy, September 2011.
table represents the transition probabilities for transition       [3] T. Giannakopoulos, A. Makris, D. Kosmopoulos,
to a violent shot while the neighbouring shots are violent.            S. Perantonis, and S. Theodoridis. Audio-visual fusion
These values were obtained from the training set. The post-            for detecting violent scenes in videos. In
processing reduced the false positives significantly.                  S. Konstantopoulos et al., editor, Artificial Intelligence:
                                                                       Theories, Models and Applications, volume 6040 of
Table 2: The transition probabilities were computed                    Lecture Notes in Computer Science, pages 91–100.
on a five consecutive shots window.                                    Springer Berlin / Heidelberg, 2010.
                  1     2      3      4      5
                                                                   [4] Y. Gong, W. Wang, S. Jiang, Q. Huang, and W. Gao.
     non-violent     0.04    0.03    0          0.03   0.04            Detecting violent scenes in movies by auditory and
     violent         0.67    0.77    1          0.77   0.67            visual cues. In Y.-M. Huang et al., editor, Advances in
                                                                       Multimedia Information Processing - PCM 2008,
We ultimately obtained the best result with the with mini-             volume 5353 of Lecture Notes in Computer Science,
mum MediaEval cost C ≈ 2.02 and recall r = 0.87 (Table 3)              pages 317–326. Springer Berlin / Heidelberg, 2008.
using LDA with prior probabilities 0.3 and 0.7 respectively        [5] G. A. Miller. Wordnet: A lexical database for english.
for non-violent class and violent. However, if we look at both         Communications of the ACM, 38:39–41, 1995.
F1 score and MediaEval cost the fourth run which was with          [6] Z. Rasheed and S. Mubarak. Video categorization using
QDA and equal prior probabilities performed better. These              semantics and semiotics. PhD thesis, Orlando, FL
results matched our expectations from the cross validation             32816, USA, 2003. AAI3110078.
results on the training set.                                       [7] Z. Rasheed, Y. Sheikh, and M. Shah. On the use of
                                                                       computable features for film classification. IEEE Trans.
4.    CONCLUSIONS                                                      Circuits Syst. Video Technol., 15(1):52–64, 2005.
We have proposed a set of features to automatically detect         [8] P. Willett. The Porter Stemming Algorithm: Then and
violent material at shot level for commercial movies. The              Now. Program: Electronic Library and Information
performance of the proposed system have been evaluated                 Systems, 40(3):219–223, 2006.