=Paper= {{Paper |id=None |storemode=property |title=Real-time entropic unsupervised violent scenes detection in Hollywood movies - DYNI @ MediaEval Affect Task 2011 |pdfUrl=https://ceur-ws.org/Vol-807/glotin_DYNI-LSIS_Violence_me11wn.pdf |volume=Vol-807 |dblpUrl=https://dblp.org/rec/conf/mediaeval/GlotinRPP11 }} ==Real-time entropic unsupervised violent scenes detection in Hollywood movies - DYNI @ MediaEval Affect Task 2011== https://ceur-ws.org/Vol-807/glotin_DYNI-LSIS_Violence_me11wn.pdf
Real-time Entropic Unsupervised Violent Scenes Detection
in Hollywood Movies - DYNI @ MediaEval Affect Task 2011

         H. Glotin(a,b,e) , J. Razik(a,b)                               S. Paris(a,c)                      J.-M. Prévot(d,b)
         {glotin,razik}@univ-tln.fr                                    paris@lsis.org                      jmp@univ-tln.fr
          (a) Information Dynamics and                            (c) Univ. Aix Marseille               (d) Computer Science
          Integration team (DYNI LSIS)                          13397 Marseille Cedex 20,                    Department
            (e) Institut universitaire de                                 France                       (b) Univ. Sud-Toulon Var
                   France (IUF)                                                                        83957 La Garde Cedex,
                                                                                                                 France


ABSTRACT                                                                        In a first approach we develop our system for uniform false
State of the art of the high level feature detectors, as vio-                   alarm and missing costs, which is not optimal according to
lent scene detectors, are supervised systems. The aim of our                    the official campaign criterion. However, according to the
proposition is to show that simple non supervised confidence                    Fmeasure criteria, our system is well performing among the
function derived from straightforward features can perform                      five other (supervised) systems submitted to the official cam-
well compared to nowdays supervised systems for this kind                       paign. We think that our feature shall then allow improve-
of hard task. Then, we develop a violent event detector                         ments when used in supervised systems.
independent of the kind of movies based on our previous re-
search on basic efficient entropic movie features. We propose                   2. ENTROPIC VISUAL CONFIDENCE
an entropic audiovisual confidence computed as the average                      We assume that the violent visual scene dynamics is univer-
of the entropies of some simple visual and acoustic features.                   sal and may match with fast orientation changes or other
In a first approach, we develop our system for uniform false                    events. Thus, as in the state of the art [1], we use a line
alarm and missing costs, which is not optimal according to                      segments detector to extract orientation features. In [3] the
the official campaign criterion. However, the usual Fmea-                       fusion of two well known line segment detectors defined a
sure metrics indicates that our system is the second best                       new and fast efficient one. This operator [3] can be seen
among the five other -supervised- submitted systems.                            as a unified approach to statistical and structural texture
                                                                                analysis. An image can be represented as a single histogram
Keywords                                                                        computed by applying a multi-scale Local Binary Pattern
Entropy features, violent event detection, audiovisual detec-                   [5,6] over the whole image. In very noisy images, a multiscale
tor, online system, unsupervised information retrieval                          [4] approach is needed to obtain correct distribution. How-
                                                                                ever, for a first approach, for each frame at time t, we only
                                                                                consider the first scale, for 12 segment lengths λ1 , ..., λ12 ,
1.    INTRODUCTION                                                              and 12 orientations θ1 , ..., θ12 (one every π/12). We extract
State of the art high level feature detectors are supervised                    one frame each second. The extraction process of this fea-
systems, including the violent event scene detectors [1]. How-                  ture is then nearly twenty times faster than real-time using
ever we assume that violent events demonstrate a specific                       our toolbox [6]. In a second step, let be, for a visual frame
dynamics that shall allow to compute on the fly a weak de-                      at time t, Xt its discrete random variable with alphabet
tector. Therefore, we develop an online movie violent event                     α = (θi , λj ), (i, j) ∈ [1, 12]2 , and probability mass function
detector independent of the kind of movies, and with no need                    p(xt ) = P r(Xt = xt ), xt ∈ α. Considering two consec-
of labeled training dataset. Based on our previous research                     utive Xt , Xt+1 , we then propose two kinds of visual confi-
on basic but efficient entropic high level feature detection [2],               dence. First we considered the shot average of the Kullback-
we propose here an entropic audiovisual unsupervised con-                       Leibler distance dKL(p(xt ), p(xt+1 )), but this run has not
fidence. It is based on the average of the entropies of some                    been submitted. Second, we set dxt = |p(xt ) − p(xt+1 )|,
simple visual and acoustic features. This paper depicts our                     which is normalized to estimate a probability mass func-
best official (run2), our run1 was acoustic only, and noised                    tion
                                                                                   P p(dxt ). Then we compute its entropy Ht = H(Xt ) =
by acoustic stream asynchrony.                                                  − dxt ∈α p(dxt).log(p(dxt )). Finally, for each shot S, the
                                                                                visual confidence is set to γv (S) = Hts for each frame ts ∈ S.

                                                                                3. ENTROPIC ACOUSTIC CONFIDENCE
                                                                                The extracted audio track was delayed with the visual track
Copyright is held by the author/owner(s). MediaEval 2011 Workshop,              for some unknown reason. However we propose a simple
September 1-2, 2011, Pisa, Italy. Acknowledgment : We particularly thank        entropic acoustic feature. First we extract Mel Filter Cep-
Technicolor Rennes, UNIGE and IRISA TexMex for their organization of            strum Coefficients (MFCC) using the SPro toolbox [7] (win-
the violent scenes detection task. We thank the NII team for their visualiza-
tion interface given for analyze after the official results.                    dow length 20 ms). We extract their speed and acceleration,
                                                                                and we remove the energy coefficients, yielding to 36 dimen-
                                                                  Table 1: Official results table of the Fmeasure crite-
                                                                  rion of the best run of each team, at the shot level.
                                                                  All runs are supervised systems on nearly twenty
                                                                  hours of labeled movies, except our. The official
                                                                  criterion is given in the row ”Weighted”, weighting
                                                                  missing cost by a factor 10.
                                                                       RUN name         Fmeas. Precis. Recall Weighted
                                                                   TECHNICOLOR,1         0,397    0,249   0,971   0,761
                                                                      DYNI LSIS,2        0,293    0,242   0,372   6,470
                                                                        UNIGE,4          0,289    0,178   0,774   2,838
                                                                          NII,6          0,245    0,140   1,000   1,000
                                                                         TUB,1           0,244    0,139   0,971   1,262
                                                                          LIG,1          0,197    0,179   0,223   7,940


                                                                  tracks (nearly two seconds of delay is observed in our system
                                                                  between visual and acoustic streams due to the extraction
                                                                  system).

                                                                  Second improvement shall consist in developing a more ac-
                                                                  curate audiovisual fusion. It shall be easily optimized on the
                                                                  training set. We shall also take into account the non uni-
                                                                  form weighted false alarm and missing costs of the official
                                                                  criteria.
Figure 1: Official Precision (X) Recall (Y) curves
of the best run of each of the six participants. The              Considering that our unsupervised system has the second
curve of our DYNI run is in red and is pointed by                 best Fmeasure, and that all the other runs are trained on
the arrow.                                                        20 hours of training set, we think that our feature shall then
                                                                  allow improvements when used into supervised systems.

sions each 10 ms. Then we set, for each shot S the acoustic       7. REFERENCES
confidence γa (S) to the complementary of the average of the      [1] Demarty C.H, Penet C., Gravier G. and Soleymani M.,
normalized entropy of each MFCC probability distribution.            The MediaEval 2011 Affect Task: Violent Scenes
                                                                     Detection in Hollywood Movies, MediaEval 2011
4.   AUDIOVISUAL CONFIDENCE                                          Workshop, Sept 2011, Pisa
For each shot, the audiovisual confidence to γ(S) = 4 ∗           [2] Glotin H., Zhao Z.Q and Ayache S., Efficient Image
γv (S) + γa (S). The overweighting of γv is due to the acous-        Concept Indexing by Harmonic and Arithmetic Profiles,
tic asynchrony, and its lower discriminative power that has          in IEEE Int. Conf. on Image Proc., ICIP, Nov 2009
been observed on the training set. We then threshold this         [3] Grompone von Gioi R., Jakubowicz J., Morel J.-M. and
final confidence in order to get twenty percent of the test set      Randall G., LSD: A Fast Line Segment Detector with a
shots as positive.                                                   False Detection Control, IEEE Trans. PAMI, 19, Dec
                                                                     2008
5.   OFFICIAL RESULTS                                             [4] Paris S. and Glotin H., PyramidalMulti-Level Features
We remind that we did not optimized our unsupervised sys-            for the robotVision@ICPR 2010 Challenge, ICPR 2010
tem according to the official weighted criterion, which is over   [5] Paris S.,Glotin H., and Zhao Z.Q., Real-time face
weighting the missing cost by ten against the false alarms           detection using Integral Histogram of Multi-Scale Local
cost. Thus the system is not performing well according               Binary Patterns, ICIC 2011
to this criterion. However the official results indicate that     [6] Paris S., Scenes Objects Classification Toolbox
according to the Fmeasure, our system performs well (see             http://www.mathworks.com/matlabcentral/fileexchange/29800-
Tab.1 and the Precision-Recall curve in Fig.1).                      scenesobjects-classification-toolbox
                                                                  [7] Gravier and al., Spro, speech signal processing toolkit,
6.   DISCUSSION AND CONCLUSION                                       INRIA project, https://gforge.inria.fr/projects/spro
The NII Lab [8] provided with their interface another anal-       [8] Vu L., Duy-Dinh L., Shinichi S., and Duc Anh D., NII,
ysis of this DYNI run : our 50 top list confidences over the         Japan at MediaEval 2011 Violent Scenes Detection Task,
three test movies are pointing to 14 relevant violent shots1 ,       in Mediaeval 2011 Proc. Demo: http://satoh-
which remains an interesting score.                                  lab.ex.nii.ac.jp/users/ledduy/Demo-MediaEval/

Further work will consist first in a technical improvement :
a better synchronism between the extracted audio and video
1
  These shots can be played on the NII interface pointed from
http://glotin.univ-tln.fr/mediaeval2011.