MTM at MediaEval 2014 Violence Detection Task

MTM at MediaEval 2014 Violence Detection Task BrunoDo Universidade Fedederal de Minas Gerais Belo Horizonte

Brazil

NascimentoTeixeira Universidade Fedederal de Minas Gerais Belo Horizonte

Brazil

MTM at MediaEval 2014 Violence Detection Task D243063AFD71F777239CDE023FF25855 GROBID - A machine learning software for extracting information from scholarly documents

This paper describes the team MTM participation in Violent Scenes Detection (VSD) task of the MediaEval 2014 campaign. We propose an approach to the problem of detecting violence, which is based on probabilistic graphical models using Mel-frequency cepstral coefficients (MFCCs) as audio feature. In our approach, we employ Dynamic Bayesian Networks (DBNs) to represent a violent scene as an dynamic system.

INTRODUCTION

The goal of the Violent Scenes Detection (VSD) task of the Me-diaEval 2014 benchmarking campaign is to detect violence in movies [5]. This year the organizers of the VSD task released two datasets: (i) a set of 31 Hollywood movies, where 24 are used for training and 7 for the testing (our focus); (ii) Youtube set, composed of 86 violent and non-violent videos. Violence is defined as "one would not let an 8 years old child see in a movie because it contains physical violence". A model based on the variable-duration hidden Markov model is proposed to detect complex events using latent variables in Internet videos [6]. The authors of [1] propose an audio-visual approach to video genre classification using content descriptors that exploit audio, color, temporal, and contour information and demonstrated good results over other existing approaches by using a combination of these descriptors in genre classification. In [2], temporal structure of broadcast tennis video is recovered from HMMs. This trained HMM is used to analyze the temporal interleaving shots.

We propose to model video based on temporal structure and principle of causality using Dynamic Bayesian Networks (DBN).

METHOD

For this year's benchmark, we have developed an acoustic system based on temporal data (MFCC vector). The main idea behind this approach is to represent a violent scene as a dynamic system.

Dynamic Bayesian Network

A DBN (see Figure 1) is a state-space model of random variable Vt [3]:

Vt = (Ut, Xt, Yt),(1)

where Ut represents the hidden, Xt the input and Yt the output variable. A pair (B1, B2) defines a DBN, where B1 and B2 are

MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain

U i−2 U i−1 U i U i+1 X i−2 X i−1 X i X i+1

Figure 1: A graphical-model view of an DBN unrolled for 4 slices with hidden state sequence U and a observed node X.

BNs. The two-slice temporal Bayes net B2 (DBN unrolled for 2 slices), defines P (Vt|Vt−1):

P (Vt|Vt−1) = N i=1 P (V t i |P a(V t i )),(2)

where P a(V t i ) are the parents in the net. Next, our acoustic feature detector is described.

Acoustic Feature Detector

Our audio concept detector is based on MFCCs. The audio signal is segmented into acoustic frames with overlapping. Acoustic frames are used to group samples using a window with fixed length. We split the audio signal into frames of 40ms length, with 20ms overlap, and apply a Hamming window to each frame. The Hamming function is given by:

w(n) = 0.54 − 0.46 cos( 2πn N − 1 ).(3)

For each audio frame, 12 MFCCs (range 133Hz-6855Hz) and their first and second derivates are computed to build an acoustic vector y j :

y j = (y j 1 , y j 2 , ..., y j 36 ).(4)

Bag of Audio Words representation

After the feature extraction, a way of representing audio is through a feature vector model using Bag of Audio Words (BoAW). In this representation, each vector has the size of the vocabulary, where each vocabulary word represents a position vector. The i th vector value for a n audio segment equals the number of occurrences of that word i in the audio segment.

SUBMITTED RUNS

For each run, a naive DBN is trained using two different observed vectors Yt: (i) acoustic vector y j , and (i) BoAW by j with 128 audio words (see Figure 2). The likelihood of a model M , P (y1:T |M ), is used to assign a sequence y1:T to non-violent or violent label as follows:

M * (y1:T ) = arg max M P (y1:T |M )P (M ).

(

The Bayes Net Toolbox for Matlab (BNT) [4] is used to train the dynamic networks.

RESULTS AND DISCUSSION

Table 1 shows the Mean Average Precision (MAP): MAP2014 and MAP@100 for the test movies. DBN with BoAW and DBN without have similar performances. Both approaches (run #1 and run #2) fail at detecting of violent scenes in the movie "8 Mile". The run #2 results are higher in the movies "BRAVEHEART", "DES-PERADO", "GHOST IN THE SHELL" and "V FOR VENDETTA", but lower for the movies "TERMINATOR 2" and "JUMANJI" in comparisom with run #1 (using MAP@100 and MAP2014 metrics). Run #2 uses BoAW representation, that has less observations (temporal segments) than run #1 approach, which uses directly the acoustic feature vector built from MFCCs. Our best result is 16.51% (MAP@100) or 2.64 % (MAP2014 ) for run #2 (see Table 2). We investigated the results and came to the presumption that BoAW removes noisy observations,while reducing the number of observations per segment. It might be related with the observation "grouping" when the BoAW is computed for the temporal segment (see Figure 2). Thus, BoAW removes data noise and builds a better representation for a scene (model observation). However, the results are still very poor. We suppose it could be due to features, only MFCCs seems not capable of distinguishing all violence and non-violence segments and generalize the violence concept. Further work directions relies in capture the causality in violence segments using different structures and other feature modalities (feature selection).

ACKNOWLEDGMENTS

This work was supported in part by two grants from CAPES and CNPq.

Table 1 :1Performance of DBNs for the violence detection task at MediaEval 2014.Sourcerun #1 DBNrun #2 DBN BoAWMeanMeanMeanMeanMeanMeanAverageAverageAverageAverageAverageAveragePrecisionPrecisionPrecisionPrecisionPrecisionPrecision(MAP)2014at 100(MAP)2014at 100(MAP2014)(MAP@100)(MAP2104)(MAP@100)8 MILE0.00000.00000.00000.00000.00000.0000BRAVEHEART0.04290.00290.03690.05720.01490.2977DESPERADO0.18750.01590.14070.21650.01730.1635GHOST IN THE SHELL0.10180.01250.04580.14010.04230.1970JUMANJI0.04800.02350.10000.04430.01800.0307TERMINATOR 20.19740.05180.19930.11130.01330.0295V FOR VENDETTA0.12010.03640.14320.09850.07940.4311Figure 2: Given a video, we split into segments and build BoAWhistograms for each segment.

Table 2 :2Global results for the violence detection task at MediaEval 2014.RunMAP@100 MAP2014#1 (MFCC-DBN)9.51 %2.04 %#2 (MFCC-BoAW-DBN)16.51 %2.64 %

Video genre categorization and representation using audio-visual information BIonescu KSeyerlehner CRasche CVertan PLambert Journal of Electronic Imaging 21 2 2012 Temporal structure analysis of broadcast tennis video using hidden markov models EKijak LOisel PGros Storage and Retrieval for Media Databases MMYeung RLienhart C.-SLi SPIE 2003 5021 SPIE Proceedings Dynamic Bayesian Networks: Representation, Inference and Learning KMurphy July 2002 UC Berkeley, Computer Science Division PhD thesis The bayes net toolbox for matlab KPMurphy Computing Science and Statistics 33 2001 The MediaEval 2014 Affect Task: Violent Scenes Detection MSjöberg BIonescu YJiang VQuang MSchedl CDemarty MediaEval 2014 Workshop

Barcelona, Spain

October 16-17 2014 Learning latent temporal structure for complex event detection KTang Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, CVPR '12 the 2012 IEEE Conference on Computer Vision and Pattern Recognition, CVPR '12

Washington, DC, USA

IEEE Computer Society 2012