LIG at MediaEval 2011 affect task: use of a generic method

                           Bahjat Safadi                                           Georges Quénot
              UJF-Grenoble 1 / UPMF-Grenoble 2 /                         UJF-Grenoble 1 / UPMF-Grenoble 2 /
              Grenoble INP / CNRS, LIG UMR 5217,                         Grenoble INP / CNRS, LIG UMR 5217,
                   Grenoble, F-38041, France                                  Grenoble, F-38041, France
                    Bahjat.Safadi@imag.fr                                   Georges.Quenot@imag.fr


ABSTRACT                                                                • color: a 4 × 4 × 4 RGB color histogram (64-dim);
This paper describes the LIG participation to the MediaEval             • texture: a 5-scale × 8-orientation Gabor transform
2011 Affect Task on violent scenes’ detection in Hollywood                (40-dim);
movies. We submitted only the required run (shot classi-
fication run) with a minimal system using only the visual               • SIFT: bag of SIFT descriptors computed using Koen
information. Color, texture and SIFT descriptors were ex-                 van de Sande’s software [5], 1000-bin histograms, four
tracted from key frames. The performance of our system was                variants were used: Harris-Laplace filtering or dense
below the performance of the systems using both audio and                 sampling with hard or fuzzy clustering.
visual information but it appeared quite good in precision.
                                                                   2.2     Descriptor optimization
Categories and Subject Descriptors                                   The descriptor optimization consists of two steps:
H.3 [Information Storage and Retrieval]: Content Anal-                  • power transformation: its goal is to normalize the dis-
ysis and Indexing                                                         tributions of the values, especially in the case of his-
                                                                          togram components. It simply consists in applying an
General Terms                                                             x ← xα tranformation on all components individually.
                                                                          The optimal value of alpha can be optimized by cross-
Algorithms, Experimentation                                               validation and is often close to 0.5 for histogram-based
                                                                          descriptors.
Keywords
                                                                        • PCA reduction: its goal is both to reduce the size
Violence detection, Affect, Video Annotation, Benchmark                   (number of dimensions) of the descriptors and to im-
                                                                          prove performance by removing noisy components. For
1.    INTRODUCTION                                                        color and texture, the optimal number of dimension is
   The MediaEval 2011 Affect Task: Violent Scenes Detec-                  close to half of the original one. For the SIFT-based
tion is fully described in [1]. It directly derives from a Tech-          descriptors, it is in the 150-250 range.
nicolor use case which aims at easing a user’s selection pro-
cess from a movie database. This task therefore applies to
                                                                   2.3     Classification
movie content.                                                        The classification was done here using a kNN-based clas-
                                                                   sifier. It is a bit less efficient than an SVM one but it is
  Our motivation was to see how a generic system for gen-          much faster.
eral concept classificationn in video shots would perform
compared to systems specifically designed for the task like [4].   2.4     Fusion
Our system is roughly a four-stage pipeline: descriptor ex-           Classification was done separately with one kNN for each
traction, descriptor optimization, classification and fusion.      descriptor variant. The outputs of these individual classifiers
Most of the stages have been optimized for the TRECVID             are then merged at the level of normalized scores (late fu-
2011 semantic indexing task [3] [2] but some parameters have       sion). A linear combination of the scores is used with weight
been specifically tuned on MediaEval development data.             optimized on the MediaEval development set. It finally ap-
                                                                   peared that, for the MediaEval task, the SIFT descriptors
2.    SYSTEM DESCRIPTION                                           did not help, compared to color and texture alone; this was
                                                                   not the case in the general context of TRECVID.
2.1    Descriptor extraction
   The descriptors were computed only on the visual infor-         3.     EXPERIMENTAL RESULTS
mation (no audio) and even only on the key frames (no mo-            Figure 1 shows the false alarms’ rate versus miss rates for
tion). Three types of descriptors were used:                       participants’ best runs. It is obtained by the application of a
                                                                   varying threshold on the scores provided by the participants.
                                                                   The LIG system performs less well than other systems using
Copyright is held by the author/owner(s).                          both audio and visual information. However, it appears to
MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy          be as good as all of them in the area of the low false alarm
                   Figure 1: False alarms’ rate versus miss rate for the participants’ best runs


rates. This means that the LIG system is able to find with       the threshold was biased a bit toward recall but not enough
a good confidence a fraction of the shots containing physical    for an optimal result with the same ranking.
violence but beyond these, it fails to detect others, probably     In our future work, we plan to improve this baseline sys-
because the audio and/or motion modalities are necessary         tem by using a better classifier (SVM-based) and include
for them.                                                        motion descriptors based on optical flow and audio descrip-
                                                                 tors based on MFCC.
                            F-measure    MediaEval cost
     Kill bill                0.19            8.58
     The Bourne Identity      0.24            6.07               5.   ACKNOWLEDGMENTS
     The wizard of Oz         0.00            10.1                 This work was partly realized as part of the Quaero Pro-
     All                      0.20            7.94               gram funded by OSEO, French State agency for innovation.

       Table 1: Performance of the LIG system                    6.   REFERENCES
                                                                 [1] C.-H. Demarty, C. Penet, G. Gravier, and
                                                                     M. Soleymani. The MediaEval 2011 Affect Task:
  Table 1 shows the performance of the LIG system using              Violent Scenes Detection in Hollywood Movies. In
the AED F-Measure (common in information retrieval) and              MediaEval 2011 Workshop, Pisa, Italy, September 1-2
the official MediaEval Cost. The MediaEval cost is highly            2011.
biased towards recall and while the threshold of our system      [2] B. Safadi, N. Derbas, A. Hamadi, F. thollard, and
was also biased in this direction it was not biased enough           G. Quénot. LIG at TRECVID 2011. In Proc.
for being optimal for this measure.                                  TRECVID Workshop, Gaithersburg, MD, USA,
                                                                     December 5-7 2011.
  While the performance of the system is consistent on Kill      [3] A. F. Smeaton, P. Over, and W. Kraaij. High-Level
Bill and The Bourne Identity, it is very bad for The wizard          Feature Detection from Video in TRECVid: a 5-Year
of Oz. The system did not found any of the 46 violent shots          Retrospective of Achievements. In A. Divakaran, editor,
though it predicted 60 positives (all false) in a total of 908       Multimedia Content Analysis, Theory and Applications,
shots. This seems to be worse than random.                           pages 151–174. Springer Verlag, Berlin, 2009.
                                                                 [4] F. D. M. d. Souza, G. C. Chavez, E. A. d. Valle Jr.,
4.    CONCLUSIONS AND FUTURE WORK                                    and A. d. A. Araujo. Violence detection in video using
  We have participated to the MediaEval 2011 affect task             spatio-temporal features. In Proceedings of the 2010
with a basic system designed for general purpose concept             23rd SIBGRAPI Conference on Graphics, Patterns and
detection in video shots. This system used only the infor-           Images, pages 224–230, Washington, DC, USA, 2010.
mation available in the key frames (no audio or motion).         [5] K. E. A. van de Sande, T. Gevers, and C. G. M. Snoek.
This system was initially intended to be used as a baseline          Evaluating color descriptors for object and scene
and specific extensions were considered but they could not           recognition. IEEE Transactions on Pattern Analysis
be finalized in time. Also, concerning the target measure,           and Machine Intelligence, 32(9):1582–1596, 2010.