=Paper=
{{Paper
|id=Vol-1263/paper82
|storemode=property
|title=MTM at MediaEval 2014 Violence Detection
|pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_82.pdf
|volume=Vol-1263
|dblpUrl=https://dblp.org/rec/conf/mediaeval/Teixeira14
}}
==MTM at MediaEval 2014 Violence Detection==
<pdf width="1500px">https://ceur-ws.org/Vol-1263/mediaeval2014_submission_82.pdf</pdf>
<pre>
             MTM at MediaEval 2014 Violence Detection Task

                                                  Bruno do Nascimento Teixeira
                                                Universidade Fedederal de Minas Gerais
                                                         Belo Horizonte, Brazil
                                                    bruno.texeira@dcc.ufmg.br


ABSTRACT
This paper describes the team MTM participation in Violent Scenes              Ui−2            Ui−1                     Ui             Ui+1
Detection (VSD) task of the MediaEval 2014 campaign. We pro-
pose an approach to the problem of detecting violence, which is
based on probabilistic graphical models using Mel-frequency cep-
stral coefficients (MFCCs) as audio feature. In our approach, we
employ Dynamic Bayesian Networks (DBNs) to represent a vio-
lent scene as an dynamic system.
                                                                               Xi−2            Xi−1                    Xi              Xi+1

1.    INTRODUCTION                                                       Figure 1: A graphical-model view of an DBN unrolled for 4 slices
   The goal of the Violent Scenes Detection (VSD) task of the Me-        with hidden state sequence U and a observed node X.
diaEval 2014 benchmarking campaign is to detect violence in movi-
es [5]. This year the organizers of the VSD task released two
datasets: (i) a set of 31 Hollywood movies, where 24 are used for
training and 7 for the testing (our focus); (ii) Youtube set, composed   BNs. The two-slice temporal Bayes net B2 (DBN unrolled for 2
of 86 violent and non-violent videos. Violence is defined as "one        slices), defines P (Vt |Vt−1 ):
would not let an 8 years old child see in a movie because it con-
tains physical violence". A model based on the variable-duration                                           N
                                                                                                           Y
hidden Markov model is proposed to detect complex events using                           P (Vt |Vt−1 ) =         P (Vit |P a(Vit )),          (2)
latent variables in Internet videos [6]. The authors of [1] propose                                        i=1
an audio-visual approach to video genre classification using con-
tent descriptors that exploit audio, color, temporal, and contour in-    where P a(Vit ) are the parents in the net. Next, our acoustic feature
formation and demonstrated good results over other existing ap-          detector is described.
proaches by using a combination of these descriptors in genre clas-
sification. In [2], temporal structure of broadcast tennis video is      2.2    Acoustic Feature Detector
recovered from HMMs. This trained HMM is used to analyze the                Our audio concept detector is based on MFCCs. The audio sig-
temporal interleaving shots.                                             nal is segmented into acoustic frames with overlapping. Acous-
   We propose to model video based on temporal structure and prin-       tic frames are used to group samples using a window with fixed
ciple of causality using Dynamic Bayesian Networks (DBN).                length. We split the audio signal into frames of 40ms length, with
                                                                         20ms overlap, and apply a Hamming window to each frame. The
2.    METHOD                                                             Hamming function is given by:
   For this year’s benchmark, we have developed an acoustic sys-
tem based on temporal data (MFCC vector). The main idea behind                                                             2πn
                                                                                         w(n) = 0.54 − 0.46 cos(               ).             (3)
this approach is to represent a violent scene as a dynamic system.                                                        N −1
                                                                           For each audio frame, 12 MFCCs (range 133Hz-6855Hz) and
2.1    Dynamic Bayesian Network                                          their first and second derivates are computed to build an acoustic
  A DBN (see Figure 1) is a state-space model of random variable         vector y j :
Vt [3]:
                                                                                                y j = (y1j , y2j , ..., y36
                                                                                                                         j
                                                                                                                            ).                (4)

                        Vt = (Ut , Xt , Yt ),                     (1)    2.3    Bag of Audio Words representation
where Ut represents the hidden, Xt the input and Yt the output             After the feature extraction, a way of representing audio is throu-
variable. A pair (B1 , B2 ) defines a DBN, where B1 and B2 are           gh a feature vector model using Bag of Audio Words (BoAW).
                                                                         In this representation, each vector has the size of the vocabulary,
                                                                         where each vocabulary word represents a position vector. The ith
Copyright is held by the author/owner(s).                                vector value for a n audio segment equals the number of occur-
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain           rences of that word i in the audio segment.
                              Table 1: Performance of DBNs for the violence detection task at MediaEval 2014.

                           Source                              run #1 DBN                            run #2 DBN BoAW
                                                      Mean          Mean       Mean     Mean                  Mean       Mean
                                                    Average       Average   Average   Average               Average   Average
                                                   Precision     Precision Precision Precision             Precision Precision
                                                     (MAP)           2014     at 100   (MAP)                   2014     at 100
                                                                (MAP2014) (MAP@100)                       (MAP2104) (MAP@100)
              8 MILE                                  0.0000        0.0000        0.0000        0.0000        0.0000        0.0000
              BRAVEHEART                              0.0429        0.0029        0.0369        0.0572        0.0149        0.2977
              DESPERADO                               0.1875        0.0159        0.1407        0.2165        0.0173        0.1635
              GHOST IN THE SHELL                      0.1018        0.0125        0.0458        0.1401        0.0423        0.1970
              JUMANJI                                 0.0480        0.0235        0.1000        0.0443        0.0180        0.0307
              TERMINATOR 2                            0.1974        0.0518        0.1993        0.1113        0.0133        0.0295
              V FOR VENDETTA                          0.1201        0.0364        0.1432        0.0985        0.0794        0.4311


                                                                             2). We investigated the results and came to the presumption that
                                                                             BoAW removes noisy observations,while reducing the number of
                                                                             observations per segment. It might be related with the observation
                                                                             "grouping" when the BoAW is computed for the temporal segment
                                                                             (see Figure 2). Thus, BoAW removes data noise and builds a bet-
                                                                             ter representation for a scene (model observation). However, the
Figure 2: Given a video, we split into segments and build BoAW               results are still very poor. We suppose it could be due to features,
histograms for each segment.                                                 only MFCCs seems not capable of distinguishing all violence and
                                                                             non-violence segments and generalize the violence concept. Fur-
                                                                             ther work directions relies in capture the causality in violence seg-
3.    SUBMITTED RUNS                                                         ments using different structures and other feature modalities (fea-
   For each run, a naive DBN is trained using two different ob-              ture selection).
served vectors Yt : (i) acoustic vector y j , and (i) BoAW by j with
128 audio words (see Figure 2). The likelihood of a model M ,                5.    ACKNOWLEDGMENTS
P (y1:T |M ), is used to assign a sequence y1:T to non-violent or             This work was supported in part by two grants from CAPES and
violent label as follows:                                                    CNPq.

            M ∗ (y1:T ) = arg max P (y1:T |M )P (M ).             (5)        6.    REFERENCES
                                M
                                                                             [1] B. Ionescu, K. Seyerlehner, C. Rasche, C. Vertan, and
  The Bayes Net Toolbox for Matlab (BNT) [4] is used to train the                P. Lambert. Video genre categorization and representation
dynamic networks.                                                                using audio-visual information. Journal of Electronic
                                                                                 Imaging, 21(2):023017–1–023017–17, 2012.
Table 2: Global results for the violence detection task at MediaEval
                                                                             [2] E. Kijak, L. Oisel, and P. Gros. Temporal structure analysis of
2014.
                                                                                 broadcast tennis video using hidden markov models. In M. M.
                                                                                 Yeung, R. Lienhart, and C.-S. Li, editors, Storage and
                   Run               MAP@100 MAP2014                             Retrieval for Media Databases, volume 5021 of SPIE
       #1 (MFCC-DBN)                    9.51 %        2.04 %                     Proceedings, pages 289–299. SPIE, 2003.
       #2 (MFCC-BoAW-DBN)              16.51 %        2.64 %                 [3] K. Murphy. Dynamic Bayesian Networks: Representation,
                                                                                 Inference and Learning. PhD thesis, UC Berkeley, Computer
                                                                                 Science Division, July 2002.
                                                                             [4] K. P. Murphy. The bayes net toolbox for matlab. Computing
4.    RESULTS AND DISCUSSION                                                     Science and Statistics, 33, 2001.
   Table 1 shows the Mean Average Precision (MAP): MAP2014                   [5] M. Sjöberg, B. Ionescu, Y. Jiang, V. Quang, M. Schedl, and
and MAP@100 for the test movies. DBN with BoAW and DBN                           C. Demarty. The MediaEval 2014 Affect Task: Violent Scenes
without have similar performances. Both approaches (run #1 and                   Detection. In MediaEval 2014 Workshop, Barcelona, Spain,
run #2) fail at detecting of violent scenes in the movie "8 Mile". The           October 16-17 2014.
run #2 results are higher in the movies "BRAVEHEART", "DES-                  [6] K. Tang. Learning latent temporal structure for complex event
PERADO", "GHOST IN THE SHELL" and "V FOR VENDETTA",                              detection. In Proceedings of the 2012 IEEE Conference on
but lower for the movies "TERMINATOR 2" and "JUMANJI" in                         Computer Vision and Pattern Recognition, CVPR ’12, pages
comparisom with run #1 (using MAP@100 and MAP2014 met-                           1250–1257, Washington, DC, USA, 2012. IEEE Computer
rics). Run #2 uses BoAW representation, that has less observations               Society.
(temporal segments) than run #1 approach, which uses directly
the acoustic feature vector built from MFCCs. Our best result is
16.51% (MAP@100) or 2.64 % (MAP2014 ) for run #2 (see Table

</pre>