=Paper=
{{Paper
|id=Vol-1263/paper82
|storemode=property
|title=MTM at MediaEval 2014 Violence Detection
|pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_82.pdf
|volume=Vol-1263
|dblpUrl=https://dblp.org/rec/conf/mediaeval/Teixeira14
}}
==MTM at MediaEval 2014 Violence Detection==
MTM at MediaEval 2014 Violence Detection Task
Bruno do Nascimento Teixeira
Universidade Fedederal de Minas Gerais
Belo Horizonte, Brazil
bruno.texeira@dcc.ufmg.br
ABSTRACT
This paper describes the team MTM participation in Violent Scenes Ui−2 Ui−1 Ui Ui+1
Detection (VSD) task of the MediaEval 2014 campaign. We pro-
pose an approach to the problem of detecting violence, which is
based on probabilistic graphical models using Mel-frequency cep-
stral coefficients (MFCCs) as audio feature. In our approach, we
employ Dynamic Bayesian Networks (DBNs) to represent a vio-
lent scene as an dynamic system.
Xi−2 Xi−1 Xi Xi+1
1. INTRODUCTION Figure 1: A graphical-model view of an DBN unrolled for 4 slices
The goal of the Violent Scenes Detection (VSD) task of the Me- with hidden state sequence U and a observed node X.
diaEval 2014 benchmarking campaign is to detect violence in movi-
es [5]. This year the organizers of the VSD task released two
datasets: (i) a set of 31 Hollywood movies, where 24 are used for
training and 7 for the testing (our focus); (ii) Youtube set, composed BNs. The two-slice temporal Bayes net B2 (DBN unrolled for 2
of 86 violent and non-violent videos. Violence is defined as "one slices), defines P (Vt |Vt−1 ):
would not let an 8 years old child see in a movie because it con-
tains physical violence". A model based on the variable-duration N
Y
hidden Markov model is proposed to detect complex events using P (Vt |Vt−1 ) = P (Vit |P a(Vit )), (2)
latent variables in Internet videos [6]. The authors of [1] propose i=1
an audio-visual approach to video genre classification using con-
tent descriptors that exploit audio, color, temporal, and contour in- where P a(Vit ) are the parents in the net. Next, our acoustic feature
formation and demonstrated good results over other existing ap- detector is described.
proaches by using a combination of these descriptors in genre clas-
sification. In [2], temporal structure of broadcast tennis video is 2.2 Acoustic Feature Detector
recovered from HMMs. This trained HMM is used to analyze the Our audio concept detector is based on MFCCs. The audio sig-
temporal interleaving shots. nal is segmented into acoustic frames with overlapping. Acous-
We propose to model video based on temporal structure and prin- tic frames are used to group samples using a window with fixed
ciple of causality using Dynamic Bayesian Networks (DBN). length. We split the audio signal into frames of 40ms length, with
20ms overlap, and apply a Hamming window to each frame. The
2. METHOD Hamming function is given by:
For this year’s benchmark, we have developed an acoustic sys-
tem based on temporal data (MFCC vector). The main idea behind 2πn
w(n) = 0.54 − 0.46 cos( ). (3)
this approach is to represent a violent scene as a dynamic system. N −1
For each audio frame, 12 MFCCs (range 133Hz-6855Hz) and
2.1 Dynamic Bayesian Network their first and second derivates are computed to build an acoustic
A DBN (see Figure 1) is a state-space model of random variable vector y j :
Vt [3]:
y j = (y1j , y2j , ..., y36
j
). (4)
Vt = (Ut , Xt , Yt ), (1) 2.3 Bag of Audio Words representation
where Ut represents the hidden, Xt the input and Yt the output After the feature extraction, a way of representing audio is throu-
variable. A pair (B1 , B2 ) defines a DBN, where B1 and B2 are gh a feature vector model using Bag of Audio Words (BoAW).
In this representation, each vector has the size of the vocabulary,
where each vocabulary word represents a position vector. The ith
Copyright is held by the author/owner(s). vector value for a n audio segment equals the number of occur-
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain rences of that word i in the audio segment.
Table 1: Performance of DBNs for the violence detection task at MediaEval 2014.
Source run #1 DBN run #2 DBN BoAW
Mean Mean Mean Mean Mean Mean
Average Average Average Average Average Average
Precision Precision Precision Precision Precision Precision
(MAP) 2014 at 100 (MAP) 2014 at 100
(MAP2014) (MAP@100) (MAP2104) (MAP@100)
8 MILE 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
BRAVEHEART 0.0429 0.0029 0.0369 0.0572 0.0149 0.2977
DESPERADO 0.1875 0.0159 0.1407 0.2165 0.0173 0.1635
GHOST IN THE SHELL 0.1018 0.0125 0.0458 0.1401 0.0423 0.1970
JUMANJI 0.0480 0.0235 0.1000 0.0443 0.0180 0.0307
TERMINATOR 2 0.1974 0.0518 0.1993 0.1113 0.0133 0.0295
V FOR VENDETTA 0.1201 0.0364 0.1432 0.0985 0.0794 0.4311
2). We investigated the results and came to the presumption that
BoAW removes noisy observations,while reducing the number of
observations per segment. It might be related with the observation
"grouping" when the BoAW is computed for the temporal segment
(see Figure 2). Thus, BoAW removes data noise and builds a bet-
ter representation for a scene (model observation). However, the
Figure 2: Given a video, we split into segments and build BoAW results are still very poor. We suppose it could be due to features,
histograms for each segment. only MFCCs seems not capable of distinguishing all violence and
non-violence segments and generalize the violence concept. Fur-
ther work directions relies in capture the causality in violence seg-
3. SUBMITTED RUNS ments using different structures and other feature modalities (fea-
For each run, a naive DBN is trained using two different ob- ture selection).
served vectors Yt : (i) acoustic vector y j , and (i) BoAW by j with
128 audio words (see Figure 2). The likelihood of a model M , 5. ACKNOWLEDGMENTS
P (y1:T |M ), is used to assign a sequence y1:T to non-violent or This work was supported in part by two grants from CAPES and
violent label as follows: CNPq.
M ∗ (y1:T ) = arg max P (y1:T |M )P (M ). (5) 6. REFERENCES
M
[1] B. Ionescu, K. Seyerlehner, C. Rasche, C. Vertan, and
The Bayes Net Toolbox for Matlab (BNT) [4] is used to train the P. Lambert. Video genre categorization and representation
dynamic networks. using audio-visual information. Journal of Electronic
Imaging, 21(2):023017–1–023017–17, 2012.
Table 2: Global results for the violence detection task at MediaEval
[2] E. Kijak, L. Oisel, and P. Gros. Temporal structure analysis of
2014.
broadcast tennis video using hidden markov models. In M. M.
Yeung, R. Lienhart, and C.-S. Li, editors, Storage and
Run MAP@100 MAP2014 Retrieval for Media Databases, volume 5021 of SPIE
#1 (MFCC-DBN) 9.51 % 2.04 % Proceedings, pages 289–299. SPIE, 2003.
#2 (MFCC-BoAW-DBN) 16.51 % 2.64 % [3] K. Murphy. Dynamic Bayesian Networks: Representation,
Inference and Learning. PhD thesis, UC Berkeley, Computer
Science Division, July 2002.
[4] K. P. Murphy. The bayes net toolbox for matlab. Computing
4. RESULTS AND DISCUSSION Science and Statistics, 33, 2001.
Table 1 shows the Mean Average Precision (MAP): MAP2014 [5] M. Sjöberg, B. Ionescu, Y. Jiang, V. Quang, M. Schedl, and
and MAP@100 for the test movies. DBN with BoAW and DBN C. Demarty. The MediaEval 2014 Affect Task: Violent Scenes
without have similar performances. Both approaches (run #1 and Detection. In MediaEval 2014 Workshop, Barcelona, Spain,
run #2) fail at detecting of violent scenes in the movie "8 Mile". The October 16-17 2014.
run #2 results are higher in the movies "BRAVEHEART", "DES- [6] K. Tang. Learning latent temporal structure for complex event
PERADO", "GHOST IN THE SHELL" and "V FOR VENDETTA", detection. In Proceedings of the 2012 IEEE Conference on
but lower for the movies "TERMINATOR 2" and "JUMANJI" in Computer Vision and Pattern Recognition, CVPR ’12, pages
comparisom with run #1 (using MAP@100 and MAP2014 met- 1250–1257, Washington, DC, USA, 2012. IEEE Computer
rics). Run #2 uses BoAW representation, that has less observations Society.
(temporal segments) than run #1 approach, which uses directly
the acoustic feature vector built from MFCCs. Our best result is
16.51% (MAP@100) or 2.64 % (MAP2014 ) for run #2 (see Table