RECOD at MediaEval 2014: Violent Scenes Detection Task

           Sandra Avila‡ , Daniel Moreira† , Mauricio Perez† , Daniel Moraes§ , Isabela Cota† ,
             Vanessa Testoni§ , Eduardo Valle‡ , Siome Goldenstein† , Anderson Rocha†
                         †
                           Institute of Computing, University of Campinas (Unicamp), SP, Brazil
          ‡
              School of Electrical and Computing Engineering, University of Campinas (Unicamp), SP, Brazil
                                        §
                                          Samsung Research Institute Brazil, SP, Brazil
            sandra@dca.fee.unicamp.br, daniel.moreira@ic.unicamp.br, {mauricio.perez
          daniel.moraes, isabela.cota}@students.ic.unicamp.br, vanessa.t@samsung.com
               dovalle@dca.fee.unicamp.br, {siome, anderson.rocha}@ic.unicamp.br

ABSTRACT                                                           ing time, we decide to resize the video. Also, we reduce the
This paper presents the RECOD approaches used in the               dimensionality of the video descriptors.
MediaEval 2014 Violent Scenes Detection task. Our system              In mid-level feature extraction, for each descriptor type,
is based on the combination of visual, audio, and text fea-        we use a bag of visual words-based representation.
tures. We also evaluate the performance of a convolutional            Furthermore, we use a visual feature extractor based on
network as a feature extractor. We combined those features         Convolutional Networks, which were trained on the Ima-
using a fusion scheme. We participated in the main and the         geNet 2012 training set [5]. It has been chosen due to
generalization tasks.                                              its very competitive results on detection and classifications
                                                                   tasks. Additionally, as far as we know, deep learning meth-
                                                                   ods have not yet been employed in the MediaEval Violent
1.    INTRODUCTION                                                 Scenes Detection task.
  The objective of the MediaEval 2014 Violent Scenes De-
tection task is to automatically detect violent scenes in movies   2.2     Audio Features
and web videos. The targeted violent scenes are those “one            Using the OpenSmile library [3], we extract several types
would not let an 8 years old child see in a movie because they     of audio features. A bag of visual words-based representa-
contain physical violence”.                                        tion is employed to quantize the audio features and a PCA
  In this year, two different datasets were proposed: (i) a        algorithm is also used to reduce the dimensionality of the
set of 31 Hollywood movies, for the main task, and (ii) a          features.
set of 86 short YouTube web videos, for the generalization
task. The training data is the same for both subtasks. A
                                                                   2.3     Text Features
detailed overview of the datasets and the subtasks can be            To represent the movie subtitles, we apply the bag of
found in [6].                                                      words approach: the most common, simple and successful
  In the following, we briefly introduce our system and dis-       document representation used so far. The bag of words vec-
cuss our results1 .                                                tor is normalized using a term’s document frequency.
                                                                     Also, before creating the bag of words representation, we
                                                                   remove the stop words and we apply a stemming algorithm
2.    SYSTEM DESCRIPTION                                           to reduce a word to its stem.

2.1    Visual Features                                             2.4     Classification
   In low-level visual feature extraction, we extract SURF            Classification is performed with Support Vector Machines
descriptors [4]. For that, we first apply the FFmpeg soft-         (SVM) classifiers, using the LIBSVM library [2]. Moreover,
ware [1] to extract and resize the video frames. Low-level         classification is done separately for each descriptor. The
visual descriptors are extracted on a dense spatial grid at        outputs of those individual classifiers are then combined at
multiple scales. Next, they are reduced using a PCA algo-          the level of normalized scores. Our fusion strategy is done
rithm.                                                             by the combination of classification outcomes optimized on
   Besides that, in order to incorporate temporal informa-         the training set.
tion, we compute dense trajectories and motion boundary
descriptors, according to [7]. Again, for the sake of process-     3.    RUNS SUBMITTED
1                                                                    In total, we generated 10 different runs: 5 runs for each
  There are some technical aspects which we cannot put di-         subtask. For main task (m), we have:
rectly in the manuscript given we are patenting the deve-
loped approach.                                                         • m1: 3 types of audio features + 3 types of visual fea-
                                                                          tures (including a visual feature extractor based on
                                                                          Convolutional Networks) + text features;
Copyright is held by the author/owner(s).                               • m2: 1 type of audio features + 3 types of visual fea-
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain            tures (including a visual feature extractor based on
        Convolutional Networks) + text features;                    groundtruth. Despite both containers store the video stream
      • m3: 1 type of audio features + 3 types of visual fea-       in H.264 format, we did not notice that the M4V conversion
        tures (including a visual feature extractor based on        resulted in a different video aspect ratio (718×432 pixels).
        Convolutional Networks);                                    Similarly, the audio encoding was also divergent: MP3 au-
      • m4: 1 type of audio features + 2 types of visual fea-       dio for MP4, while AAC audio for M4V. Next, due to frame
        tures + text features;                                      synchronization issue, we kept the test data in its original
      • m5: 1 type of audio features.                               format (MPEG-2, 720×576 pixels, with AC3 audio). There-
                                                                    fore, we faced the problem of dealing with different aspect
     For generalization task (g), we have:                          ratios in training and testing data, as well as distinct audio
                                                                    formats.
      • g1: 3 types of audio features + 3 types of visual fea-
                                                                      For the generalization task, the problem is alleviated be-
        tures (including a visual feature extractor based on
                                                                    cause the test data is provided in MP4.
        Convolutional Networks);
                                                                      Tables 3 reports the unofficial (u) results for main task
      • g2: 1 type of audio features + 3 types of visual fea-
                                                                    that we evaluated ourselves. Here, the results are obtained
        tures (including a visual feature extractor based on
                                                                    by using the data (training and test sets) in MPEG format.
        Convolutional Networks);
                                                                    The first column indicates which input features were used:
      • g3: 1 type of audio features + 2 types of visual fea-       u1 for 1 type of audio features and u2 for text features.
        tures;                                                      Unfortunately, due to time constraints, we were not able to
      • g4: 1 type of audio features;                               prepare more runs.
      • g5: 1 type of visual features.                                It should be mentioned first that, the results for run u2,
                                                                    are independent of the video format, since we directly ex-
4.     RESULTS AND DISCUSSION                                       tracted the movie subtitles from DVD. For run u1, we can
  Tables 1 and 2 show the performance of our system for             notice a considerable improvement of classification perfor-
main and generalization task, respectively. We can notice           mance, from 0.315 (run m5) to 0.493 (run u1), confirming
that, despite the diversity of fusion strategies, the differences   the negative impact of using distinct audio formats. We are
among most runs (m1, m2, m3 and g1, g2, g3) are quite               currently investigating the impact on visual features.
small. We are currently investigating such results. Also, we
observe that, for run m4, we selected a wrong threshold2 by                8mil   Brav   Desp   Ghos Juma Term Vven MAP
mistake.                                                             u1    0.351 0.601 0.636 0.530 0.521 0.352 0.463 0.493
                                                                     u2    0.402 0.237 0.407 0.345 0.232 0.277 0.188 0.298
          8mil   Brav    Desp   Ghos Juma Term Vven MAP
 m1       0.204 0.477 0.337 0.567 0.188 0.479 0.378 0.376           Table 3: Unofficial results obtained for the main task in
                                                                    terms of MAP2014.
 m2       0.239 0.459 0.308 0.348 0.362 0.465 0.431 0.373
 m3       0.189 0.545 0.277 0.465 0.212 0.418 0.489 0.371
 m4       0.115 0.319 0.209 0.270 0.159 0.502 0.167 0.249
                                                                    5.    ACKNOWLEDGMENTS
                                                                        This research was partially supported by FAPESP, CAPES,
 m5       0.373 0.301 0.307 0.423 0.175 0.308 0.317 0.315
                                                                    CNPq and Project “Capacitação em Tecnologia de Informa-
Table 1: Official results obtained for the main task in terms       ção” financed by Samsung Eletrônica da Amazônia Ltda., us-
of MAP2014.                                                         ing resources provided by the Informatics Law no. 8.248/91.

                                                                    6.    REFERENCES
                    g1      g2      g3       g4     g5              [1] FFmpeg. http://www.ffmpeg.org/.
           MAP     0.618   0.615   0.604   0.545   0.515            [2] C.-C. Chang and C.-J. Lin. LIBSVM: A library for
                                                                        support vector machines. ACM TIST, 2(3):1–27, 2011.
Table 2: Official results obtained for the generalization task      [3] F. Eyben, M. Wöllmer, and B. Schuller. OpenSmile:
in terms of MAP2014.                                                    the munich versatile and fast open-source audio feature
                                                                        extractor. In ACM Multimedia, pages 1459–1462, 2010.
   For the main task, our results are considerably below our        [4] D. Lowe. Distinctive image features from scale-invariant
expectations (based on our training results). By analyz-                keypoints. International Journal of Computer Vision
ing the results, we pointed out a crucial difference between            (IJCV), 60:91–110, 2004.
training and test videos. In the Violent Scenes Detection           [5] O. Russakovsky, J. Deng, H. Su, J. Krause,
task, the participants are instructed in how to extract the             S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,
DVD data and convert it to MPEG format. For the sake                    M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet
of saving disk space, we opted to convert the MPEG video                Large Scale Visual Recognition Challenge.
files to MP4 or to M4V. However, that choice introduced a               arXiv:1409.0575, 2014.
set of problems.                                                    [6] M. Sjöberg, B. Ionescu, Y. Jiang, V. Quang, M. Schedl,
   First, with respect to the training data, we converted the           and C. Demarty. The MediaEval 2014 Affect Task:
MPEG video files to MP4 or to M4V, depending on which                   Violent Scenes Detection. In MediaEval 2014
video container we were able to successfully synchronize                Workshop, Barcelona, Spain, October 16–17 2014.
the extracted frames, regarding the numbers given by the            [7] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Dense
2                                                                       trajectories and motion boundary descriptors for action
  Scenes are classified as violent or non-violent based on a
certain threshold.                                                      recognition. Interna, 103:60–79, 2013.