LIG at MediaEval 2011 affect task: use of a generic method Bahjat Safadi Georges Quénot UJF-Grenoble 1 / UPMF-Grenoble 2 / UJF-Grenoble 1 / UPMF-Grenoble 2 / Grenoble INP / CNRS, LIG UMR 5217, Grenoble INP / CNRS, LIG UMR 5217, Grenoble, F-38041, France Grenoble, F-38041, France Bahjat.Safadi@imag.fr Georges.Quenot@imag.fr ABSTRACT • color: a 4 × 4 × 4 RGB color histogram (64-dim); This paper describes the LIG participation to the MediaEval • texture: a 5-scale × 8-orientation Gabor transform 2011 Affect Task on violent scenes’ detection in Hollywood (40-dim); movies. We submitted only the required run (shot classi- fication run) with a minimal system using only the visual • SIFT: bag of SIFT descriptors computed using Koen information. Color, texture and SIFT descriptors were ex- van de Sande’s software [5], 1000-bin histograms, four tracted from key frames. The performance of our system was variants were used: Harris-Laplace filtering or dense below the performance of the systems using both audio and sampling with hard or fuzzy clustering. visual information but it appeared quite good in precision. 2.2 Descriptor optimization Categories and Subject Descriptors The descriptor optimization consists of two steps: H.3 [Information Storage and Retrieval]: Content Anal- • power transformation: its goal is to normalize the dis- ysis and Indexing tributions of the values, especially in the case of his- togram components. It simply consists in applying an General Terms x ← xα tranformation on all components individually. The optimal value of alpha can be optimized by cross- Algorithms, Experimentation validation and is often close to 0.5 for histogram-based descriptors. Keywords • PCA reduction: its goal is both to reduce the size Violence detection, Affect, Video Annotation, Benchmark (number of dimensions) of the descriptors and to im- prove performance by removing noisy components. For 1. INTRODUCTION color and texture, the optimal number of dimension is The MediaEval 2011 Affect Task: Violent Scenes Detec- close to half of the original one. For the SIFT-based tion is fully described in [1]. It directly derives from a Tech- descriptors, it is in the 150-250 range. nicolor use case which aims at easing a user’s selection pro- cess from a movie database. This task therefore applies to 2.3 Classification movie content. The classification was done here using a kNN-based clas- sifier. It is a bit less efficient than an SVM one but it is Our motivation was to see how a generic system for gen- much faster. eral concept classificationn in video shots would perform compared to systems specifically designed for the task like [4]. 2.4 Fusion Our system is roughly a four-stage pipeline: descriptor ex- Classification was done separately with one kNN for each traction, descriptor optimization, classification and fusion. descriptor variant. The outputs of these individual classifiers Most of the stages have been optimized for the TRECVID are then merged at the level of normalized scores (late fu- 2011 semantic indexing task [3] [2] but some parameters have sion). A linear combination of the scores is used with weight been specifically tuned on MediaEval development data. optimized on the MediaEval development set. It finally ap- peared that, for the MediaEval task, the SIFT descriptors 2. SYSTEM DESCRIPTION did not help, compared to color and texture alone; this was not the case in the general context of TRECVID. 2.1 Descriptor extraction The descriptors were computed only on the visual infor- 3. EXPERIMENTAL RESULTS mation (no audio) and even only on the key frames (no mo- Figure 1 shows the false alarms’ rate versus miss rates for tion). Three types of descriptors were used: participants’ best runs. It is obtained by the application of a varying threshold on the scores provided by the participants. The LIG system performs less well than other systems using Copyright is held by the author/owner(s). both audio and visual information. However, it appears to MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy be as good as all of them in the area of the low false alarm Figure 1: False alarms’ rate versus miss rate for the participants’ best runs rates. This means that the LIG system is able to find with the threshold was biased a bit toward recall but not enough a good confidence a fraction of the shots containing physical for an optimal result with the same ranking. violence but beyond these, it fails to detect others, probably In our future work, we plan to improve this baseline sys- because the audio and/or motion modalities are necessary tem by using a better classifier (SVM-based) and include for them. motion descriptors based on optical flow and audio descrip- tors based on MFCC. F-measure MediaEval cost Kill bill 0.19 8.58 The Bourne Identity 0.24 6.07 5. ACKNOWLEDGMENTS The wizard of Oz 0.00 10.1 This work was partly realized as part of the Quaero Pro- All 0.20 7.94 gram funded by OSEO, French State agency for innovation. Table 1: Performance of the LIG system 6. REFERENCES [1] C.-H. Demarty, C. Penet, G. Gravier, and M. Soleymani. The MediaEval 2011 Affect Task: Table 1 shows the performance of the LIG system using Violent Scenes Detection in Hollywood Movies. In the AED F-Measure (common in information retrieval) and MediaEval 2011 Workshop, Pisa, Italy, September 1-2 the official MediaEval Cost. The MediaEval cost is highly 2011. biased towards recall and while the threshold of our system [2] B. Safadi, N. Derbas, A. Hamadi, F. thollard, and was also biased in this direction it was not biased enough G. Quénot. LIG at TRECVID 2011. In Proc. for being optimal for this measure. TRECVID Workshop, Gaithersburg, MD, USA, December 5-7 2011. While the performance of the system is consistent on Kill [3] A. F. Smeaton, P. Over, and W. Kraaij. High-Level Bill and The Bourne Identity, it is very bad for The wizard Feature Detection from Video in TRECVid: a 5-Year of Oz. The system did not found any of the 46 violent shots Retrospective of Achievements. In A. Divakaran, editor, though it predicted 60 positives (all false) in a total of 908 Multimedia Content Analysis, Theory and Applications, shots. This seems to be worse than random. pages 151–174. Springer Verlag, Berlin, 2009. [4] F. D. M. d. Souza, G. C. Chavez, E. A. d. Valle Jr., 4. CONCLUSIONS AND FUTURE WORK and A. d. A. Araujo. Violence detection in video using We have participated to the MediaEval 2011 affect task spatio-temporal features. In Proceedings of the 2010 with a basic system designed for general purpose concept 23rd SIBGRAPI Conference on Graphics, Patterns and detection in video shots. This system used only the infor- Images, pages 224–230, Washington, DC, USA, 2010. mation available in the key frames (no audio or motion). [5] K. E. A. van de Sande, T. Gevers, and C. G. M. Snoek. This system was initially intended to be used as a baseline Evaluating color descriptors for object and scene and specific extensions were considered but they could not recognition. IEEE Transactions on Pattern Analysis be finalized in time. Also, concerning the target measure, and Machine Intelligence, 32(9):1582–1596, 2010.