VISILAB at MediaEval 2013: Fight Detection

                                   Ismael Serrano, Oscar Déniz, Gloria Bueno
                                      VISILAB group, University of Castilla-La Mancha
                                       E.T.S.I.Industriales, Avda. Camilo José Cela 3,
                                                  13071 Ciudad Real, Spain.
              Ismael.Serrano@uclm.es, Oscar.Deniz@uclm.es, Gloria.Bueno@uclm.es


ABSTRACT
Fight detection from video is a task with direct application
in surveillance scenarios like prison cells, yards, mental in-
stitutions, etc. Work in this task is growing, although only
a few datasets for fight detection currently exist. The Vi-
olent Scene Detection task of the MediaEval initiative of-
fers a practical challenge for detecting violent video clips in
movies. In this working notes paper we will briefly describe
our method for fight detection. This method has been used
to detect fights within the above-mentioned violent scene
detection task. Inspired by results that suggest that kine-
matic features alone are discriminative for at least some ac-
tions, our method uses extreme acceleration patterns as the
main feature. These extreme accelerations are efficiently
estimated by applying the Radon transform to the power
spectrum of consecutive frames.

Keywords
Action recognition, violence detection, fight detection

1.   INTRODUCTION
   In the last years, the problem of human action recognition
from video has become tractable by using computer vision
techniques [5]. Despite its potential usefulness, the specific
task of violent scene detection has been comparatively less
studied. The annual MediaEval evaluation campaign intro-
duced this specific problem in 2011. For an overview of this
year‘s task please see [4].                                       Figure 1: Two consecutive frames in a fight clip from
                                                                  a movie. Note the blur on the left side of the second
2.   PROPOSED METHOD                                              frame.
   The presence of large accelerations is key in the task of
fight detection ([6], [2] and [1]). In this context, body part
tracking can be considered, as in [3], which introduced the
so-called Acceleration Measure Vectors (AMV). In general,         shots in 50-frame clips. First, we compute the power spec-
acceleration can be inferred from tracked point trajectories.     trum of two consecutive frames. It can be shown that, when
However, extreme acceleration implies image blur (see for         there is a sudden motion between two consecutive frames,
example Figure 1), which makes tracking less precise or even      the power spectrum image of will depict an ellipse. The
impossible.                                                       orientation of the ellipse is perpendicular to the motion di-
   Motion blur entails a shift in image content towards low       rection, the frequencies outside the ellipse being attenuated.
frequencies. Such behaviour allows building an efficient ac-      Most importantly, the eccentricity of this ellipse is depen-
celeration estimator for video. Our proposed method works         dent on the acceleration. Basically, the proposed method
with sequences of 50 frames; therefore we need to divide the      aims at detecting the sudden presence of such ellipse.
                                                                    Our objective is then to detect such ellipse and estimate its
                                                                  eccentricity, which represents the magnitude of the acceler-
Copyright is held by the author/owner(s).                         ation. Ellipse detection can be reliably performed using the
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain    Radon transform, which provides image projections along
                                                                 and AED F-measure values, whereas Table 2 shows the eval-
Table 1: AED precision, recall and F-measure at                  uation results for the submitted runs, MAP at 20 and 100.
video shot level
    Run                 AED-P AED-R AED-F
    Run1-classifier-knn  0.1178 0.6265 0.1982                    4.   CONCLUSIONS
    Run2-classifier-svm  0.1440 0.4482 0.1700                       Based on the observation that kinematic information may
                                                                 suffice for human perception of various actions, in this work a
                                                                 novel fight detection method is proposed which uses extreme
Table 2: Mean Average Precision (MAP) values at                  acceleration patterns as the main discriminating feature.
20 and 100                                                          In experiments with other datasets we obtained accuracies
    Run                 MAP at 20 MAP at 100                     above 90%, and processing times of a few milliseconds per
    Run1-classifier-knn    0.1475     0.1343                     frame. The results on the MediaEval dataset are, however,
    Run2-classifier-svm    0.1350     0.1498                     very poor. We suppose it could be due to the test ground
                                                                 truth (used by the organizers to obtain the performance
                                                                 measures). The category ’violence’ is more general because
lines with different orientations.                               includes violent scenes that could include explosions, shots,
   For each pair of consecutive frames, we compute the power     car chases, fights, etc. Although we had ’fight’ labelled train-
spectrum using the 2D Fast Fourier Transform (in order to        ing videos, that label was not available for test videos. What
avoid edge effects, a Hanning window is applied before com-      is more, the definition of ’violence’ in MediaEval is ”physical
puting the FFT). Let us call these spectra images Pi−1 and       violence or accident resulting in human injury or pain”, so
Pi . These images are divided, i.e. C = Pi /Pi−1 .               we were able to detect only part of the violence: fights.
   When there is no change between the two frames, the              On the other hand, there are a number of practical as-
power spectra will be equal and C will have a constant value.    pects that are not taken into account. In many surveillance
When motion has occurred, an ellipse will appear in C. El-       scenarios, for example, we do not have access to color im-
lipse detection can be reliably performed using the Radon        ages and audio, and the typical forms of violence are fights
transform.                                                       and vandalism, instead of explosions, car chases, etc. The
   After applying the Radon transform to image C, its verti-     processing power needed for running detection algorithms is
cal maximum projection vector is obtained and normalized         also an important issue in those applications.
(to maximum value 1). Next when there is an ellipse, this
vector will show a sharp peak, representing the major axis       5.   ACKNOWLEDGMENTS
of the ellipse. The kurtosis of this vector is therefore taken      This work has been supported by Project TIN2011-24367
as an estimation of the acceleration.                            from Spain’s Ministerio de Economı́a y Competitividad.
   Note that kurtosis alone cannot be used as a measure,            The authors also thank the MediaEval organizers for invit-
since it is obtained from a normalized vector (i.e. it is di-    ing us to participate.
mensionless). Thus, the average value per pixel P of image
C is also computed, taken as an additional feature. Without
it, any two frames could lead to high kurtosis even without
                                                                 6.   REFERENCES
significant motion.                                              [1] G. Castellano, S. Villalba, and A. Camurri. Recognising
   Deceleration was also considered as an additional feature,        human emotions from body movement and gesture
and it can be obtained by reversing the order to consecutive         dynamics. Affective Computing and Intelligent
frames and applying the same algorithm explained above.              Interaction, 4738(1):17–82, 2007.
For every short clip, we compute histograms of these fea-        [2] T. J. Clarke, M. F. Bradshaw, D. T. Field, S. E.
tures, so that acceleration/deceleration patterns can be used        Hampson, and D. Rose. The perception of emotion from
for discrimination.                                                  body movement in point-light displays of interpersonal
   Once we have features for every clip, for classification          dialogue. Perception, 34(1):1171–1180, 2005.
we use two different classifiers: the well-known K-Nearest       [3] A. Datta, M. Shah, and N. D. V. Lobo.
Neighbours and Support Vector Machine (SVM) with a lin-              Person-on-person violence detection in video data.
eal kernel.                                                          Pattern Recognition. Proceedings. 16th International
   The method described requires training clips that contain         Conference, 1(1):433–438, 2002.
fights. Thus, when fight sequences are given, they may have      [4] C. H. Demarty, C. Penet, M. Schedl, B. Ionescu,
to be first evaluated for fight subsequences.                        V. Quang, and Y. G. Jiang. The 2013 Affect Task:
                                                                     Violent Scenes Detection. In MediaEval 2013
3.   EXPERIMENT                                                      Workshop, Barcelona, Spain, October 18-19 2013.
                                                                 [5] R. Poppe. A survey on vision-based human action
  Sometimes, within a violent segment non-fight clips may
                                                                     recognition. Image and Vision Computing,
appear. For each violent segment (as provided by MediaE-
                                                                     28(6):976–990, 2010.
val organizers), we manually removed clips without fighting
action. Moreover, random clips of 50 consecutive frames,         [6] M. Saerbeck and C. Bartneck. Perception of affect
taken outside the violent segments, were selected for the            elicited by robot motion. In Proceedings of the 5th
non-fight class. We trained the two classifiers for the fight        ACM/IEEE international conference on Human-robot
concept. Then we apply the two trained classifiers to every          interaction, 10(1):53–60, 2010.
50 consecutive frames of the test set.
  We submitted 2 runs and the details of performances are
as follows. Table 1 reports AED [4] precision, AED recall