VISILAB at MediaEval 2013: Fight Detection Ismael Serrano, Oscar Déniz, Gloria Bueno VISILAB group, University of Castilla-La Mancha E.T.S.I.Industriales, Avda. Camilo José Cela 3, 13071 Ciudad Real, Spain. Ismael.Serrano@uclm.es, Oscar.Deniz@uclm.es, Gloria.Bueno@uclm.es ABSTRACT Fight detection from video is a task with direct application in surveillance scenarios like prison cells, yards, mental in- stitutions, etc. Work in this task is growing, although only a few datasets for fight detection currently exist. The Vi- olent Scene Detection task of the MediaEval initiative of- fers a practical challenge for detecting violent video clips in movies. In this working notes paper we will briefly describe our method for fight detection. This method has been used to detect fights within the above-mentioned violent scene detection task. Inspired by results that suggest that kine- matic features alone are discriminative for at least some ac- tions, our method uses extreme acceleration patterns as the main feature. These extreme accelerations are efficiently estimated by applying the Radon transform to the power spectrum of consecutive frames. Keywords Action recognition, violence detection, fight detection 1. INTRODUCTION In the last years, the problem of human action recognition from video has become tractable by using computer vision techniques [5]. Despite its potential usefulness, the specific task of violent scene detection has been comparatively less studied. The annual MediaEval evaluation campaign intro- duced this specific problem in 2011. For an overview of this year‘s task please see [4]. Figure 1: Two consecutive frames in a fight clip from a movie. Note the blur on the left side of the second 2. PROPOSED METHOD frame. The presence of large accelerations is key in the task of fight detection ([6], [2] and [1]). In this context, body part tracking can be considered, as in [3], which introduced the so-called Acceleration Measure Vectors (AMV). In general, shots in 50-frame clips. First, we compute the power spec- acceleration can be inferred from tracked point trajectories. trum of two consecutive frames. It can be shown that, when However, extreme acceleration implies image blur (see for there is a sudden motion between two consecutive frames, example Figure 1), which makes tracking less precise or even the power spectrum image of will depict an ellipse. The impossible. orientation of the ellipse is perpendicular to the motion di- Motion blur entails a shift in image content towards low rection, the frequencies outside the ellipse being attenuated. frequencies. Such behaviour allows building an efficient ac- Most importantly, the eccentricity of this ellipse is depen- celeration estimator for video. Our proposed method works dent on the acceleration. Basically, the proposed method with sequences of 50 frames; therefore we need to divide the aims at detecting the sudden presence of such ellipse. Our objective is then to detect such ellipse and estimate its eccentricity, which represents the magnitude of the acceler- Copyright is held by the author/owner(s). ation. Ellipse detection can be reliably performed using the MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain Radon transform, which provides image projections along and AED F-measure values, whereas Table 2 shows the eval- Table 1: AED precision, recall and F-measure at uation results for the submitted runs, MAP at 20 and 100. video shot level Run AED-P AED-R AED-F Run1-classifier-knn 0.1178 0.6265 0.1982 4. CONCLUSIONS Run2-classifier-svm 0.1440 0.4482 0.1700 Based on the observation that kinematic information may suffice for human perception of various actions, in this work a novel fight detection method is proposed which uses extreme Table 2: Mean Average Precision (MAP) values at acceleration patterns as the main discriminating feature. 20 and 100 In experiments with other datasets we obtained accuracies Run MAP at 20 MAP at 100 above 90%, and processing times of a few milliseconds per Run1-classifier-knn 0.1475 0.1343 frame. The results on the MediaEval dataset are, however, Run2-classifier-svm 0.1350 0.1498 very poor. We suppose it could be due to the test ground truth (used by the organizers to obtain the performance measures). The category ’violence’ is more general because lines with different orientations. includes violent scenes that could include explosions, shots, For each pair of consecutive frames, we compute the power car chases, fights, etc. Although we had ’fight’ labelled train- spectrum using the 2D Fast Fourier Transform (in order to ing videos, that label was not available for test videos. What avoid edge effects, a Hanning window is applied before com- is more, the definition of ’violence’ in MediaEval is ”physical puting the FFT). Let us call these spectra images Pi−1 and violence or accident resulting in human injury or pain”, so Pi . These images are divided, i.e. C = Pi /Pi−1 . we were able to detect only part of the violence: fights. When there is no change between the two frames, the On the other hand, there are a number of practical as- power spectra will be equal and C will have a constant value. pects that are not taken into account. In many surveillance When motion has occurred, an ellipse will appear in C. El- scenarios, for example, we do not have access to color im- lipse detection can be reliably performed using the Radon ages and audio, and the typical forms of violence are fights transform. and vandalism, instead of explosions, car chases, etc. The After applying the Radon transform to image C, its verti- processing power needed for running detection algorithms is cal maximum projection vector is obtained and normalized also an important issue in those applications. (to maximum value 1). Next when there is an ellipse, this vector will show a sharp peak, representing the major axis 5. ACKNOWLEDGMENTS of the ellipse. The kurtosis of this vector is therefore taken This work has been supported by Project TIN2011-24367 as an estimation of the acceleration. from Spain’s Ministerio de Economı́a y Competitividad. Note that kurtosis alone cannot be used as a measure, The authors also thank the MediaEval organizers for invit- since it is obtained from a normalized vector (i.e. it is di- ing us to participate. mensionless). Thus, the average value per pixel P of image C is also computed, taken as an additional feature. Without it, any two frames could lead to high kurtosis even without 6. REFERENCES significant motion. [1] G. Castellano, S. Villalba, and A. Camurri. Recognising Deceleration was also considered as an additional feature, human emotions from body movement and gesture and it can be obtained by reversing the order to consecutive dynamics. Affective Computing and Intelligent frames and applying the same algorithm explained above. Interaction, 4738(1):17–82, 2007. For every short clip, we compute histograms of these fea- [2] T. J. Clarke, M. F. Bradshaw, D. T. Field, S. E. tures, so that acceleration/deceleration patterns can be used Hampson, and D. Rose. The perception of emotion from for discrimination. body movement in point-light displays of interpersonal Once we have features for every clip, for classification dialogue. Perception, 34(1):1171–1180, 2005. we use two different classifiers: the well-known K-Nearest [3] A. Datta, M. Shah, and N. D. V. Lobo. Neighbours and Support Vector Machine (SVM) with a lin- Person-on-person violence detection in video data. eal kernel. Pattern Recognition. Proceedings. 16th International The method described requires training clips that contain Conference, 1(1):433–438, 2002. fights. Thus, when fight sequences are given, they may have [4] C. H. Demarty, C. Penet, M. Schedl, B. Ionescu, to be first evaluated for fight subsequences. V. Quang, and Y. G. Jiang. The 2013 Affect Task: Violent Scenes Detection. In MediaEval 2013 3. EXPERIMENT Workshop, Barcelona, Spain, October 18-19 2013. [5] R. Poppe. A survey on vision-based human action Sometimes, within a violent segment non-fight clips may recognition. Image and Vision Computing, appear. For each violent segment (as provided by MediaE- 28(6):976–990, 2010. val organizers), we manually removed clips without fighting action. Moreover, random clips of 50 consecutive frames, [6] M. Saerbeck and C. Bartneck. Perception of affect taken outside the violent segments, were selected for the elicited by robot motion. In Proceedings of the 5th non-fight class. We trained the two classifiers for the fight ACM/IEEE international conference on Human-robot concept. Then we apply the two trained classifiers to every interaction, 10(1):53–60, 2010. 50 consecutive frames of the test set. We submitted 2 runs and the details of performances are as follows. Table 1 reports AED [4] precision, AED recall