=Paper=
{{Paper
|id=None
|storemode=property
|title=VISILAB at MediaEval 2013: Fight Detection
|pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_11.pdf
|volume=Vol-1043
|dblpUrl=https://dblp.org/rec/conf/mediaeval/SerranoDB13
}}
==VISILAB at MediaEval 2013: Fight Detection==
VISILAB at MediaEval 2013: Fight Detection
Ismael Serrano, Oscar Déniz, Gloria Bueno
VISILAB group, University of Castilla-La Mancha
E.T.S.I.Industriales, Avda. Camilo José Cela 3,
13071 Ciudad Real, Spain.
Ismael.Serrano@uclm.es, Oscar.Deniz@uclm.es, Gloria.Bueno@uclm.es
ABSTRACT
Fight detection from video is a task with direct application
in surveillance scenarios like prison cells, yards, mental in-
stitutions, etc. Work in this task is growing, although only
a few datasets for fight detection currently exist. The Vi-
olent Scene Detection task of the MediaEval initiative of-
fers a practical challenge for detecting violent video clips in
movies. In this working notes paper we will briefly describe
our method for fight detection. This method has been used
to detect fights within the above-mentioned violent scene
detection task. Inspired by results that suggest that kine-
matic features alone are discriminative for at least some ac-
tions, our method uses extreme acceleration patterns as the
main feature. These extreme accelerations are efficiently
estimated by applying the Radon transform to the power
spectrum of consecutive frames.
Keywords
Action recognition, violence detection, fight detection
1. INTRODUCTION
In the last years, the problem of human action recognition
from video has become tractable by using computer vision
techniques [5]. Despite its potential usefulness, the specific
task of violent scene detection has been comparatively less
studied. The annual MediaEval evaluation campaign intro-
duced this specific problem in 2011. For an overview of this
year‘s task please see [4]. Figure 1: Two consecutive frames in a fight clip from
a movie. Note the blur on the left side of the second
2. PROPOSED METHOD frame.
The presence of large accelerations is key in the task of
fight detection ([6], [2] and [1]). In this context, body part
tracking can be considered, as in [3], which introduced the
so-called Acceleration Measure Vectors (AMV). In general, shots in 50-frame clips. First, we compute the power spec-
acceleration can be inferred from tracked point trajectories. trum of two consecutive frames. It can be shown that, when
However, extreme acceleration implies image blur (see for there is a sudden motion between two consecutive frames,
example Figure 1), which makes tracking less precise or even the power spectrum image of will depict an ellipse. The
impossible. orientation of the ellipse is perpendicular to the motion di-
Motion blur entails a shift in image content towards low rection, the frequencies outside the ellipse being attenuated.
frequencies. Such behaviour allows building an efficient ac- Most importantly, the eccentricity of this ellipse is depen-
celeration estimator for video. Our proposed method works dent on the acceleration. Basically, the proposed method
with sequences of 50 frames; therefore we need to divide the aims at detecting the sudden presence of such ellipse.
Our objective is then to detect such ellipse and estimate its
eccentricity, which represents the magnitude of the acceler-
Copyright is held by the author/owner(s). ation. Ellipse detection can be reliably performed using the
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain Radon transform, which provides image projections along
and AED F-measure values, whereas Table 2 shows the eval-
Table 1: AED precision, recall and F-measure at uation results for the submitted runs, MAP at 20 and 100.
video shot level
Run AED-P AED-R AED-F
Run1-classifier-knn 0.1178 0.6265 0.1982 4. CONCLUSIONS
Run2-classifier-svm 0.1440 0.4482 0.1700 Based on the observation that kinematic information may
suffice for human perception of various actions, in this work a
novel fight detection method is proposed which uses extreme
Table 2: Mean Average Precision (MAP) values at acceleration patterns as the main discriminating feature.
20 and 100 In experiments with other datasets we obtained accuracies
Run MAP at 20 MAP at 100 above 90%, and processing times of a few milliseconds per
Run1-classifier-knn 0.1475 0.1343 frame. The results on the MediaEval dataset are, however,
Run2-classifier-svm 0.1350 0.1498 very poor. We suppose it could be due to the test ground
truth (used by the organizers to obtain the performance
measures). The category ’violence’ is more general because
lines with different orientations. includes violent scenes that could include explosions, shots,
For each pair of consecutive frames, we compute the power car chases, fights, etc. Although we had ’fight’ labelled train-
spectrum using the 2D Fast Fourier Transform (in order to ing videos, that label was not available for test videos. What
avoid edge effects, a Hanning window is applied before com- is more, the definition of ’violence’ in MediaEval is ”physical
puting the FFT). Let us call these spectra images Pi−1 and violence or accident resulting in human injury or pain”, so
Pi . These images are divided, i.e. C = Pi /Pi−1 . we were able to detect only part of the violence: fights.
When there is no change between the two frames, the On the other hand, there are a number of practical as-
power spectra will be equal and C will have a constant value. pects that are not taken into account. In many surveillance
When motion has occurred, an ellipse will appear in C. El- scenarios, for example, we do not have access to color im-
lipse detection can be reliably performed using the Radon ages and audio, and the typical forms of violence are fights
transform. and vandalism, instead of explosions, car chases, etc. The
After applying the Radon transform to image C, its verti- processing power needed for running detection algorithms is
cal maximum projection vector is obtained and normalized also an important issue in those applications.
(to maximum value 1). Next when there is an ellipse, this
vector will show a sharp peak, representing the major axis 5. ACKNOWLEDGMENTS
of the ellipse. The kurtosis of this vector is therefore taken This work has been supported by Project TIN2011-24367
as an estimation of the acceleration. from Spain’s Ministerio de Economı́a y Competitividad.
Note that kurtosis alone cannot be used as a measure, The authors also thank the MediaEval organizers for invit-
since it is obtained from a normalized vector (i.e. it is di- ing us to participate.
mensionless). Thus, the average value per pixel P of image
C is also computed, taken as an additional feature. Without
it, any two frames could lead to high kurtosis even without
6. REFERENCES
significant motion. [1] G. Castellano, S. Villalba, and A. Camurri. Recognising
Deceleration was also considered as an additional feature, human emotions from body movement and gesture
and it can be obtained by reversing the order to consecutive dynamics. Affective Computing and Intelligent
frames and applying the same algorithm explained above. Interaction, 4738(1):17–82, 2007.
For every short clip, we compute histograms of these fea- [2] T. J. Clarke, M. F. Bradshaw, D. T. Field, S. E.
tures, so that acceleration/deceleration patterns can be used Hampson, and D. Rose. The perception of emotion from
for discrimination. body movement in point-light displays of interpersonal
Once we have features for every clip, for classification dialogue. Perception, 34(1):1171–1180, 2005.
we use two different classifiers: the well-known K-Nearest [3] A. Datta, M. Shah, and N. D. V. Lobo.
Neighbours and Support Vector Machine (SVM) with a lin- Person-on-person violence detection in video data.
eal kernel. Pattern Recognition. Proceedings. 16th International
The method described requires training clips that contain Conference, 1(1):433–438, 2002.
fights. Thus, when fight sequences are given, they may have [4] C. H. Demarty, C. Penet, M. Schedl, B. Ionescu,
to be first evaluated for fight subsequences. V. Quang, and Y. G. Jiang. The 2013 Affect Task:
Violent Scenes Detection. In MediaEval 2013
3. EXPERIMENT Workshop, Barcelona, Spain, October 18-19 2013.
[5] R. Poppe. A survey on vision-based human action
Sometimes, within a violent segment non-fight clips may
recognition. Image and Vision Computing,
appear. For each violent segment (as provided by MediaE-
28(6):976–990, 2010.
val organizers), we manually removed clips without fighting
action. Moreover, random clips of 50 consecutive frames, [6] M. Saerbeck and C. Bartneck. Perception of affect
taken outside the violent segments, were selected for the elicited by robot motion. In Proceedings of the 5th
non-fight class. We trained the two classifiers for the fight ACM/IEEE international conference on Human-robot
concept. Then we apply the two trained classifiers to every interaction, 10(1):53–60, 2010.
50 consecutive frames of the test set.
We submitted 2 runs and the details of performances are
as follows. Table 1 reports AED [4] precision, AED recall