=Paper=
{{Paper
|id=None
|storemode=property
|title=Real-time entropic unsupervised violent scenes detection in Hollywood movies - DYNI @ MediaEval Affect Task 2011
|pdfUrl=https://ceur-ws.org/Vol-807/glotin_DYNI-LSIS_Violence_me11wn.pdf
|volume=Vol-807
|dblpUrl=https://dblp.org/rec/conf/mediaeval/GlotinRPP11
}}
==Real-time entropic unsupervised violent scenes detection in Hollywood movies - DYNI @ MediaEval Affect Task 2011==
Real-time Entropic Unsupervised Violent Scenes Detection in Hollywood Movies - DYNI @ MediaEval Affect Task 2011 H. Glotin(a,b,e) , J. Razik(a,b) S. Paris(a,c) J.-M. Prévot(d,b) {glotin,razik}@univ-tln.fr paris@lsis.org jmp@univ-tln.fr (a) Information Dynamics and (c) Univ. Aix Marseille (d) Computer Science Integration team (DYNI LSIS) 13397 Marseille Cedex 20, Department (e) Institut universitaire de France (b) Univ. Sud-Toulon Var France (IUF) 83957 La Garde Cedex, France ABSTRACT In a first approach we develop our system for uniform false State of the art of the high level feature detectors, as vio- alarm and missing costs, which is not optimal according to lent scene detectors, are supervised systems. The aim of our the official campaign criterion. However, according to the proposition is to show that simple non supervised confidence Fmeasure criteria, our system is well performing among the function derived from straightforward features can perform five other (supervised) systems submitted to the official cam- well compared to nowdays supervised systems for this kind paign. We think that our feature shall then allow improve- of hard task. Then, we develop a violent event detector ments when used in supervised systems. independent of the kind of movies based on our previous re- search on basic efficient entropic movie features. We propose 2. ENTROPIC VISUAL CONFIDENCE an entropic audiovisual confidence computed as the average We assume that the violent visual scene dynamics is univer- of the entropies of some simple visual and acoustic features. sal and may match with fast orientation changes or other In a first approach, we develop our system for uniform false events. Thus, as in the state of the art [1], we use a line alarm and missing costs, which is not optimal according to segments detector to extract orientation features. In [3] the the official campaign criterion. However, the usual Fmea- fusion of two well known line segment detectors defined a sure metrics indicates that our system is the second best new and fast efficient one. This operator [3] can be seen among the five other -supervised- submitted systems. as a unified approach to statistical and structural texture analysis. An image can be represented as a single histogram Keywords computed by applying a multi-scale Local Binary Pattern Entropy features, violent event detection, audiovisual detec- [5,6] over the whole image. In very noisy images, a multiscale tor, online system, unsupervised information retrieval [4] approach is needed to obtain correct distribution. How- ever, for a first approach, for each frame at time t, we only consider the first scale, for 12 segment lengths λ1 , ..., λ12 , 1. INTRODUCTION and 12 orientations θ1 , ..., θ12 (one every π/12). We extract State of the art high level feature detectors are supervised one frame each second. The extraction process of this fea- systems, including the violent event scene detectors [1]. How- ture is then nearly twenty times faster than real-time using ever we assume that violent events demonstrate a specific our toolbox [6]. In a second step, let be, for a visual frame dynamics that shall allow to compute on the fly a weak de- at time t, Xt its discrete random variable with alphabet tector. Therefore, we develop an online movie violent event α = (θi , λj ), (i, j) ∈ [1, 12]2 , and probability mass function detector independent of the kind of movies, and with no need p(xt ) = P r(Xt = xt ), xt ∈ α. Considering two consec- of labeled training dataset. Based on our previous research utive Xt , Xt+1 , we then propose two kinds of visual confi- on basic but efficient entropic high level feature detection [2], dence. First we considered the shot average of the Kullback- we propose here an entropic audiovisual unsupervised con- Leibler distance dKL(p(xt ), p(xt+1 )), but this run has not fidence. It is based on the average of the entropies of some been submitted. Second, we set dxt = |p(xt ) − p(xt+1 )|, simple visual and acoustic features. This paper depicts our which is normalized to estimate a probability mass func- best official (run2), our run1 was acoustic only, and noised tion P p(dxt ). Then we compute its entropy Ht = H(Xt ) = by acoustic stream asynchrony. − dxt ∈α p(dxt).log(p(dxt )). Finally, for each shot S, the visual confidence is set to γv (S) = Hts for each frame ts ∈ S. 3. ENTROPIC ACOUSTIC CONFIDENCE The extracted audio track was delayed with the visual track Copyright is held by the author/owner(s). MediaEval 2011 Workshop, for some unknown reason. However we propose a simple September 1-2, 2011, Pisa, Italy. Acknowledgment : We particularly thank entropic acoustic feature. First we extract Mel Filter Cep- Technicolor Rennes, UNIGE and IRISA TexMex for their organization of strum Coefficients (MFCC) using the SPro toolbox [7] (win- the violent scenes detection task. We thank the NII team for their visualiza- tion interface given for analyze after the official results. dow length 20 ms). We extract their speed and acceleration, and we remove the energy coefficients, yielding to 36 dimen- Table 1: Official results table of the Fmeasure crite- rion of the best run of each team, at the shot level. All runs are supervised systems on nearly twenty hours of labeled movies, except our. The official criterion is given in the row ”Weighted”, weighting missing cost by a factor 10. RUN name Fmeas. Precis. Recall Weighted TECHNICOLOR,1 0,397 0,249 0,971 0,761 DYNI LSIS,2 0,293 0,242 0,372 6,470 UNIGE,4 0,289 0,178 0,774 2,838 NII,6 0,245 0,140 1,000 1,000 TUB,1 0,244 0,139 0,971 1,262 LIG,1 0,197 0,179 0,223 7,940 tracks (nearly two seconds of delay is observed in our system between visual and acoustic streams due to the extraction system). Second improvement shall consist in developing a more ac- curate audiovisual fusion. It shall be easily optimized on the training set. We shall also take into account the non uni- form weighted false alarm and missing costs of the official criteria. Figure 1: Official Precision (X) Recall (Y) curves of the best run of each of the six participants. The Considering that our unsupervised system has the second curve of our DYNI run is in red and is pointed by best Fmeasure, and that all the other runs are trained on the arrow. 20 hours of training set, we think that our feature shall then allow improvements when used into supervised systems. sions each 10 ms. Then we set, for each shot S the acoustic 7. REFERENCES confidence γa (S) to the complementary of the average of the [1] Demarty C.H, Penet C., Gravier G. and Soleymani M., normalized entropy of each MFCC probability distribution. The MediaEval 2011 Affect Task: Violent Scenes Detection in Hollywood Movies, MediaEval 2011 4. AUDIOVISUAL CONFIDENCE Workshop, Sept 2011, Pisa For each shot, the audiovisual confidence to γ(S) = 4 ∗ [2] Glotin H., Zhao Z.Q and Ayache S., Efficient Image γv (S) + γa (S). The overweighting of γv is due to the acous- Concept Indexing by Harmonic and Arithmetic Profiles, tic asynchrony, and its lower discriminative power that has in IEEE Int. Conf. on Image Proc., ICIP, Nov 2009 been observed on the training set. We then threshold this [3] Grompone von Gioi R., Jakubowicz J., Morel J.-M. and final confidence in order to get twenty percent of the test set Randall G., LSD: A Fast Line Segment Detector with a shots as positive. False Detection Control, IEEE Trans. PAMI, 19, Dec 2008 5. OFFICIAL RESULTS [4] Paris S. and Glotin H., PyramidalMulti-Level Features We remind that we did not optimized our unsupervised sys- for the robotVision@ICPR 2010 Challenge, ICPR 2010 tem according to the official weighted criterion, which is over [5] Paris S.,Glotin H., and Zhao Z.Q., Real-time face weighting the missing cost by ten against the false alarms detection using Integral Histogram of Multi-Scale Local cost. Thus the system is not performing well according Binary Patterns, ICIC 2011 to this criterion. However the official results indicate that [6] Paris S., Scenes Objects Classification Toolbox according to the Fmeasure, our system performs well (see http://www.mathworks.com/matlabcentral/fileexchange/29800- Tab.1 and the Precision-Recall curve in Fig.1). scenesobjects-classification-toolbox [7] Gravier and al., Spro, speech signal processing toolkit, 6. DISCUSSION AND CONCLUSION INRIA project, https://gforge.inria.fr/projects/spro The NII Lab [8] provided with their interface another anal- [8] Vu L., Duy-Dinh L., Shinichi S., and Duc Anh D., NII, ysis of this DYNI run : our 50 top list confidences over the Japan at MediaEval 2011 Violent Scenes Detection Task, three test movies are pointing to 14 relevant violent shots1 , in Mediaeval 2011 Proc. Demo: http://satoh- which remains an interesting score. lab.ex.nii.ac.jp/users/ledduy/Demo-MediaEval/ Further work will consist first in a technical improvement : a better synchronism between the extracted audio and video 1 These shots can be played on the NII interface pointed from http://glotin.univ-tln.fr/mediaeval2011.