Automatic Violence Scenes Detection: A Multi-Modal Approach Gabin Gninkoun Mohammad Soleymani Computer Science Department Computer Science Department University of Geneva University of Geneva Switzerland Switzerland gabin.gninkoun@gmail.com mohammad.soleymani@unige.ch ABSTRACT given in the task overview paper [2]. In this working note, we propose a set of features and a clas- sification scheme for detecting automatically violent scenes 2. FEATURES AND METHODS in movies. The features are extracted from audio, video, and subtitles modalities of the movies. In violent scenes classi- fication, we found the following features relevant: the short 2.1 Proposed Content-Based Features time audio energy, motion component, and shot words rate. We classified the shots into violent and non-violent using 2.1.1 Audio-Visual features naı̈ve Bayesian, Linear Discriminant Analysis (LDA), and The extracted audio features are: energy entropy, signal am- Quadratic Discriminant Analysis (QDA) targeting to maxi- plitude, short time energy, zero crossing rate, spectral flux, mize the precision of the detection in the first two minutes spectral rolloff. A more detailed description can be found in of retrieved content. [3]. In visual modality, we extracted the shot length, the shot motion component, the skewness of the motion vectors and the shot motion content. The description of the technique Keywords used to compute the shot motion component is given in [6]. Violence, audio feature extraction, visual feature extraction, text-based features, subtitles, violence scenes detection, clas- 2.1.2 Text-based features sification The subtitles available for all DVDs carry a semantic infor- mation of the movie content. We have parsed the file con- 1. INTRODUCTION tent in a set of CaptionsElement where CaptionsElement is Visual media is nowadays full of violent scenes. Therefore, an object having the four attributes (num, startTime, stop- multimedia content is rated to protect minors or warn the Time, stemmedWords). The attribute num corresponds to viewers for graphic or inappropriate images. the dialogue position in the subtitles file content. Manual rating of all existing content is not feasible for the Each dialogue text is first tokenized. The English stops fast growing digital media. An automatic method that can words were first removed. Then, we used WordNet [5] to detect violence in movies, including verbal, semantic and remove names from the remaining words. Afterwards, we visual violence can help video on demand services as well as applied the Porter stemming algorithm [8] on to get the online multimedia repositories to rate their content. stemmed words. Two features have been derived from the In this task, we have used audio, visual and text modali- text modality: shot words rate (SWR) and shot swearing ties to detect violent scenes in movies at shot level. Despite words rate (SSWR). We defined SWR as the estimated amount its importance, this problem has not been extensively ad- of words in a shot. Similar to SWR, SSWR corresponds dressed in literature. Giannakopoulos et al. combined vi- to the estimated amount of swearing words in a shot. To sual and audio features to design a multi-modal fusion pro- compute SSWR, a list of the 341 most currently used swear- cess [3]. Two individual kNN classifiers (audio based and ing words was obtained from a swearing words dictionary visual based) were trained in order to distinguish violence (http://www.noswearing.com/dictionary) and used in our and non-violence at segment level. de Souza et al. developed swearing words detector. a violence segment detector based on the concept of visual All the proposed content-based features have been extracted codebook with usage of a Linear Support Vector Machines using shot boundaries provided by MediaEval [2]. In total, (LSVM). The visual codebook was defined using a k-means we extracted 15 features/statistics from three modalities. clustering algorithm. The input video data was segmented into shots, which were converted into bags of visual words 2.2 Discriminant Analysis and Post-Processing [1]. The current study’s task and its dataset are provided by Three different classifiers were applied to detect violent shots, Technicolor for the MediaEval benchmarking initiative 2011. namely, QDA, LDA and naı̈ve Bayesian classifier. A post- The details about the task, the dataset and annotations are processing on the results of QDA was also done to consider the temporal correlation between consecutive shots. The post-processing consists on smoothing the confidence scores Copyright is held by the author/owner(s). for the violent class from QDA using weights found from the MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy transition probabilities on the training set. 3. EXPERIMENTS AND RESULTS Table 3: Violence detection system evaluation re- According to the requirements, we have generated five differ- sults for the 5 submitted runs at shot level. ent runs trying different classifiers with prior probabilities. Run Precision Recall F- MediaEval These characteristics are listed in Table 1. measure cost 1 0.174 0.377 0.238 6.522 Table 1: The classifiers and prior probabilities for 2 0.164 0.870 0.276 2.024 five submitted runs. pn is the prior probability for 3 0.183 0.426 0.256 6.049 the non-violent and pp is for violent. 4 0.178 0.774 0.289 2.838 Run Classifier pn pp 5 0.252 0.077 0.119 9.252 1 LDA 0.5 0.5 2 LDA 0.3 0.7 3 Naı̈ve Bayesian 0.5 0.5 based on a detection cost function weighting by false alarms and missed detections rate. The short time energy, the mo- 4 QDA 0.5 0.5 tion component and the shot words rate are proposed and 5 QDA + Post-processing - - used as relevant features to classify a movie’s shots as vio- lent or non-violent. The proposed methods were unable to 3.1 Evaluation criteria and Classifier selection detect all the violent scenes without sacrificing the false pos- The goal of the violence detection in the proposed use case itive rate. This is due to the fact that the proposed features scenario is to provide the user with the most violent shots are not enough to capture all violent actions or events. Auto- in the movie. We defined an evaluation criteria based on matic detection of more high level concepts such as scream, this use case scenario as follows. The detected violent shots explosion, or blood are needed to improve the detections. were first ranked based on their confidence scores, the first 2 minutes on the top of the list were set aside as the re- 5. ACKNOWLEDGEMENTS trieved content. The precision, recall and F1 score were This work is supported by the European Community’s Sev- then computed for the top ranked two minutes shots. We enth Framework Programme [FP7/2007-2011] under grant used a K-fold cross validation with K = 11 and different agreement Petamedia No. 216444. prior probabilities for each class. The best performance was achieved using LDA and QDA methods with equal prior 6. REFERENCES probabilities for both classes. [1] F. D. M. de Souza, G. C. Chavez, E. A. do Valle Jr., and A. de A. Araujo. Violence detection in video using 3.2 Post-processing spatio-temporal features. Graphics, Patterns and The results of the last run correspond to the post-processing Images, SIBGRAPI Conference on, 0:224–230, 2010. of the fourth run. A weighted average of the confidence [2] C.-H. Demarty, C. Penet, G. Gravier, and scores was used to smooth the violent shots’ decisions. The M. Soleymani. The MediaEval 2011 Affect Task: weights are given in Table 2 where the first row represents Violent Scenes Detection in Hollywood Movies. In the probabilities of transition to a violent shot while the four Working notes Proceeding of Medieval workshop, Pisa, neighbouring shots are non-violent. The second row of the Italy, September 2011. table represents the transition probabilities for transition [3] T. Giannakopoulos, A. Makris, D. Kosmopoulos, to a violent shot while the neighbouring shots are violent. S. Perantonis, and S. Theodoridis. Audio-visual fusion These values were obtained from the training set. The post- for detecting violent scenes in videos. In processing reduced the false positives significantly. S. Konstantopoulos et al., editor, Artificial Intelligence: Theories, Models and Applications, volume 6040 of Table 2: The transition probabilities were computed Lecture Notes in Computer Science, pages 91–100. on a five consecutive shots window. Springer Berlin / Heidelberg, 2010. 1 2 3 4 5 [4] Y. Gong, W. Wang, S. Jiang, Q. Huang, and W. Gao. non-violent 0.04 0.03 0 0.03 0.04 Detecting violent scenes in movies by auditory and violent 0.67 0.77 1 0.77 0.67 visual cues. In Y.-M. Huang et al., editor, Advances in Multimedia Information Processing - PCM 2008, We ultimately obtained the best result with the with mini- volume 5353 of Lecture Notes in Computer Science, mum MediaEval cost C ≈ 2.02 and recall r = 0.87 (Table 3) pages 317–326. Springer Berlin / Heidelberg, 2008. using LDA with prior probabilities 0.3 and 0.7 respectively [5] G. A. Miller. Wordnet: A lexical database for english. for non-violent class and violent. However, if we look at both Communications of the ACM, 38:39–41, 1995. F1 score and MediaEval cost the fourth run which was with [6] Z. Rasheed and S. Mubarak. Video categorization using QDA and equal prior probabilities performed better. These semantics and semiotics. PhD thesis, Orlando, FL results matched our expectations from the cross validation 32816, USA, 2003. AAI3110078. results on the training set. [7] Z. Rasheed, Y. Sheikh, and M. Shah. On the use of computable features for film classification. IEEE Trans. 4. CONCLUSIONS Circuits Syst. Video Technol., 15(1):52–64, 2005. We have proposed a set of features to automatically detect [8] P. Willett. The Porter Stemming Algorithm: Then and violent material at shot level for commercial movies. The Now. Program: Electronic Library and Information performance of the proposed system have been evaluated Systems, 40(3):219–223, 2006.