=Paper= {{Paper |id=Vol-1436/Paper31 |storemode=property |title=UMons at MediaEval 2015 Affective Impact of Movies Task including Violent Scenes Detection |pdfUrl=https://ceur-ws.org/Vol-1436/Paper31.pdf |volume=Vol-1436 |dblpUrl=https://dblp.org/rec/conf/mediaeval/SeddatiKPDMD15 }} ==UMons at MediaEval 2015 Affective Impact of Movies Task including Violent Scenes Detection == https://ceur-ws.org/Vol-1436/Paper31.pdf
 UMons at MediaEval 2015 Affective Impact of Movies Task
           including Violent Scenes Detection

                  Omar Seddati1 , Emre Kulah2 , Gueorgui Pironkov1 , Stéphane Dupont1 ,
                                    Saïd Mahmoudi1 , Thierry Dutoit1
                                                 1
                                                     University of Mons, Belgium
              {omar.seddati, gueorgui.pironkov, stephane.dupont, said.mahmoudi, thierry.dutoit}@umons.ac.be

                                     2
                                         Middle East Technical University, Ankara, Turkey
                                                     emre.kulah@ceng.metu.edu.tr



ABSTRACT                                                                Both subtasks are examined on the same dataset. Around
In this paper, we present the work done at UMons regard-             10,000 video clips from professional and amateur movies are
ing the MediaEval 2015 Affective Impact of Movies Task               used, all under Creative Commons license. More informa-
(including Violent Scenes Detection). This task can be di-           tion about these subtasks can be found in [6].
vided into two subtasks. On the one hand, Violent Scene
Detection, which means automatically finding scenes that             2.    APPROACH
are violent in a set if videos. And on the other hand, evalu-           We use the same techniques for the VSD and affect detec-
ate the affective impact of the video, through an estimation         tion subtasks. In our approach audio and video information
of the valence and arousal. In order to offer a solution for         are analyzed separately. Thus, two different feature extrac-
both detection and classification subtasks, we investigate dif-      tion methods are applied depending of the features.
ferent visual and auditory feature extraction methods. An
i-vector approach is applied for the audio, and optical flow         2.1    Audio approach
maps processed through a deep convolutional neural network              For the audio processing we use the same method as [2],
are tested for extracting features from the video. Classifiers       where i-vectors and Probabilistic Linear Discriminant Anal-
based on probabilistic linear discriminant analysis and fully        ysis (pLDA) are used to classify environments (wedding cer-
connected feed-forward neural networks are then used.                emony, birthday party, parade, etc.). The i-vector approach
                                                                     consists of extracting a low-dimensional feature vector from
1.   INTRODUCTION                                                    high-dimensional data without losing most of the relevant
   With the increasing amount of video content available, the        acoustic information. This method was introduced by the
aim of MediaEval 2015 “Affective Impact of Movies Task“ is           speaker recognition community and has also proven its ef-
to show users (depending on their age, preferences or mood)          ficiency in language detection or in speaker adaptation for
the content they are looking for. More precisely, this year          speech recognition.
the task focuses on two different aspects.                              In order to extract the i-vectors and classify them through
   The first subtask is Violent Scene Detection (VSD), the           pLDA, we have used the Matlab MSR Identity Toolbox [5].
goal being to alert parents about the potentially violent con-       For each audio track of the video shots, we extract 20 Mel-
tent of a video. Thus, the criterion for VSD used for an-            frequency cepstral coefficients, and the associated first and
notation is: “videos one would not let an 8 years old child          second derivatives. Thus, we use as input 60-dimensional
see because of their physical violence“. Another possible ap-        features with a fixed length of 800 frames for each shot.
plication could be facilitating video surveillance alerts, as        For each shot a 100-dimensional i-vector is extracted. All
monitoring several screens simultaneously is a complicated           the i-vectors are then processed through three independent
task, even for humans.                                               classifiers. The first one is trained to classify violent and
   Additionally to VSD, and for the first time at this year’s        non-violent scenes. The second one differentiate positive,
MediaEval workshop, a second subtask is examined: In-                neutral and negative valence. The third one is trained on
duced Affect Detection. This subtask focuses on the impact           the three different levels of arousal.
emotions can have for video or movie suggestions. Each
video scene is categorized depending of its valence class (pos-      2.2    Video approach
itive - neutral - negative) and its arousal class (active - neu-        Convolutional neural networks (ConvNets) are a state-of-
tral - passive). The purpose here is to predict the feelings         the-art technique in the field of object recognition within im-
that a particular video will cause to an user in order to rec-       ages. ConvNets applied to 2D images are adapted to capture
ommend him similar or completely different content.                  spatial configurations. Using them to capture temporal in-
                                                                     formation related to changes between video frames requires
                                                                     using several frames as input. A drawback is that it signif-
Copyright is held by the author/owner(s).                            icantly increases the dimensionality of the input. Thus, an
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany          alternative approach consists of using optical flow maps as
           Table 1: ConvNet architecture                          Table 2: Mean Average Precision (MAP) on Vio-
 Ind    Type          Filter size Filter num             Stride   lence detection
 1      Conv          7x7         32                     2          Run                                    MAP (%)
 2      ReLU          -           -                      -          i-vector - pLDA                        9,56
 3      Maxpool       3x3         -                      2          optical flow maps - ConvNets           9,67
 4      Normalization                                               optical flow maps - ConvNets - HMDB-51 6,56
 5      Conv          5x5         96                     1
 6      ReLU          -           -                      -
 7      Maxpool       3x3         -                      2        Table 3: Global accuracy for Affect detection.
 8      Normalization                                             (OFM stands for optical flow maps)
 9      Conv          3x3         96                     1          Run                         Valence Arousal
 10     ReLU          -           -                      -                                      (%)     (%)
 11     Maxpool       3x3         -                      2          i-vector - pLDA             37.03   31.71
 12     Conv          3x3         96                     1          OFM - ConvNets              35.28   44.39
 13     ReLU          -           -                      -          OFM - ConvNets - HMDB-51 37.28      52.44
 14     Maxpool       3x3         -                      2
 15     Conv          3x3         96                     1
                                                                  layers fully connected neural network for each of the three
 16     ReLU          -           -                      -
                                                                  subtasks.
 17     FC            -           1024                   -
 18     Dropout       -           -                      -
 19     ReLU          -           -                      -        3.    RESULTS AND DISCUSSION
 20     FC            -           512                    -           We have submitted three runs for both subtasks. The
 21     Dropout       -           -                      -        results for VSD tasks are presented in Table 2. The Mean
 22     ReLU          -           -                      -        Average Precision (MAP) is computed for each run. We can
 23     FC            -           2 or 3                 -        see that using external data from HMDB in order to train
 24     LogSoftMax    -           -                      -        the feature extractor is less efficient than training the feature
                                                                  extractor on the MediaEval dataset. The i-vector & pLDA
                                                                  technique present similar results as the optical flow maps &
                                                                  ConvNets association.
input. Each map represents the motion of each pixel be-              The global accuracy for the affect detection task is shown
tween two successive frames.                                      if Table 3. For the valence, all methods give similar results.
   We used TV-L1 [4] algorithm from the OpenCV toolbox            A difference appears for the arousal task. The audio features
for optical flow extraction. We use 10 stacked optical flow       perform poorly in comparison to the other runs. Using ex-
frames as input. Note that 10 stacked optical flows equals        ternal data proves here to be more interesting as the last run
20 maps given that both horizontal and vertical components        significantly outperforms the second run. Motion seems to
have to be provided. In order to reduce overfitting we use        be an important discriminative factor for arousal estimation.
dropout, as well as data augmentation by cropping and flip-
ping randomly the maps of the input sequence. We also
                                                                  3.1    Discussion
estimate the motion of the camera by calculating the mean           We have also investigated merging the audio and visual
across the maps of the same component (horizontal and ver-        features together. The features from the ConvNets extrac-
tical), then we subtract the corresponding mean. Our sys-         tor and the i-vectors were used as input to another neural
tem is tested on the publicly available Torch toolbox [1]         network. But the results were poorer than using the features
which offers a powerful and varied set of tools, especially for   separately. Further work will investigate audio-visual fusion
building and training ConvNets. The details for the used          more in depth.
architecture are listed in Table 1.
   Using dense optical flow maps, means that the size of the      4.    CONCLUSION
neural network increases rapidly with the length of the se-         In this paper we presented two approaches for both affect
quence used as input. This implies that short sub-sequences       and violent scene detection. Visual and audio features are
of video frames (or rather optical flow maps) have to be          processed separately. Both features are giving similar re-
used as input to the ConvNet. This increases the risk that        sults for violence detection and valence. For arousal, video
those sub-sequences fall on parts of the video where there is     features are far more interesting, especially when the Conv-
no useful information for the identification of the category.     Nets feature extractor is trained on external data. Our fu-
To tackle this problem, we use a sliding window approach          ture work will focus on the merging the audio and video
at test time, estimating the probability for each category        features.
in several sub-sequences of the video. The class with the
highest probability after averaging over all the different sub-
sequence probabilities is selected as the most likely class.
                                                                  5.    ACKNOWLEDGEMENTS
   We also train a ConvNet with the same architecture on             This work has been partly funded by the Walloon Region
the HMDB-51 dataset [3] (action recognition benchmark),           of Belgium through the Chist-Era IMOTION project (In-
in order to build a more robust motion feature extractor          telligent Multi-Modal Augmented Video Motion Retrieval
leveraging this additional external data. Then, we extract        System) and by the European Regional Development Fund
features from the MediaEval annotated data and train a two        (ERDF) through the DigiSTORM project.
6.   REFERENCES
[1] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7:
    A matlab-like environment for machine learning. In
    BigLearn, NIPS Workshop, number
    EPFL-CONF-192376, 2011.
[2] B. Elizalde, H. Lei, and G. Friedland. An i-vector
    representation of acoustic environments for audio-based
    video event detection on user generated content. In
    Multimedia (ISM), 2013 IEEE International
    Symposium on, pages 114–117. IEEE, 2013.
[3] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and
    T. Serre. HMDB: a large video database for human
    motion recognition. In Computer Vision (ICCV), 2011
    IEEE International Conference on, pages 2556–2563.
    IEEE, 2011.
[4] J. S. Pérez, E. Meinhardt-Llopis, and G. Facciolo.
    TV-L1 optical flow estimation. Image Processing On
    Line, 2013:137–150, 2013.
[5] S. O. Sadjadi, M. Slaney, and L. Heck. MSR identity
    toolbox v1. 0: A MATLAB toolbox for speaker
    recognition research. Speech and Language Processing
    Technical Committee Newsletter, 2013.
[6] M. Sjöberg, B. Ionescu, H. Wang, Y. Baveye,
    E. Dellandréa, L. Chen, V. L. Quang, M. Schedl, and
    C.-H. Demarty. The MediaEval 2015 affective impact of
    movies task. In MediaEval 2015 Workshop, Wurzen,
    Germany, 2015.