=Paper=
{{Paper
|id=Vol-1436/Paper31
|storemode=property
|title=UMons at MediaEval 2015 Affective Impact of Movies Task including Violent Scenes Detection
|pdfUrl=https://ceur-ws.org/Vol-1436/Paper31.pdf
|volume=Vol-1436
|dblpUrl=https://dblp.org/rec/conf/mediaeval/SeddatiKPDMD15
}}
==UMons at MediaEval 2015 Affective Impact of Movies Task including Violent Scenes Detection ==
UMons at MediaEval 2015 Affective Impact of Movies Task including Violent Scenes Detection Omar Seddati1 , Emre Kulah2 , Gueorgui Pironkov1 , Stéphane Dupont1 , Saïd Mahmoudi1 , Thierry Dutoit1 1 University of Mons, Belgium {omar.seddati, gueorgui.pironkov, stephane.dupont, said.mahmoudi, thierry.dutoit}@umons.ac.be 2 Middle East Technical University, Ankara, Turkey emre.kulah@ceng.metu.edu.tr ABSTRACT Both subtasks are examined on the same dataset. Around In this paper, we present the work done at UMons regard- 10,000 video clips from professional and amateur movies are ing the MediaEval 2015 Affective Impact of Movies Task used, all under Creative Commons license. More informa- (including Violent Scenes Detection). This task can be di- tion about these subtasks can be found in [6]. vided into two subtasks. On the one hand, Violent Scene Detection, which means automatically finding scenes that 2. APPROACH are violent in a set if videos. And on the other hand, evalu- We use the same techniques for the VSD and affect detec- ate the affective impact of the video, through an estimation tion subtasks. In our approach audio and video information of the valence and arousal. In order to offer a solution for are analyzed separately. Thus, two different feature extrac- both detection and classification subtasks, we investigate dif- tion methods are applied depending of the features. ferent visual and auditory feature extraction methods. An i-vector approach is applied for the audio, and optical flow 2.1 Audio approach maps processed through a deep convolutional neural network For the audio processing we use the same method as [2], are tested for extracting features from the video. Classifiers where i-vectors and Probabilistic Linear Discriminant Anal- based on probabilistic linear discriminant analysis and fully ysis (pLDA) are used to classify environments (wedding cer- connected feed-forward neural networks are then used. emony, birthday party, parade, etc.). The i-vector approach consists of extracting a low-dimensional feature vector from 1. INTRODUCTION high-dimensional data without losing most of the relevant With the increasing amount of video content available, the acoustic information. This method was introduced by the aim of MediaEval 2015 “Affective Impact of Movies Task“ is speaker recognition community and has also proven its ef- to show users (depending on their age, preferences or mood) ficiency in language detection or in speaker adaptation for the content they are looking for. More precisely, this year speech recognition. the task focuses on two different aspects. In order to extract the i-vectors and classify them through The first subtask is Violent Scene Detection (VSD), the pLDA, we have used the Matlab MSR Identity Toolbox [5]. goal being to alert parents about the potentially violent con- For each audio track of the video shots, we extract 20 Mel- tent of a video. Thus, the criterion for VSD used for an- frequency cepstral coefficients, and the associated first and notation is: “videos one would not let an 8 years old child second derivatives. Thus, we use as input 60-dimensional see because of their physical violence“. Another possible ap- features with a fixed length of 800 frames for each shot. plication could be facilitating video surveillance alerts, as For each shot a 100-dimensional i-vector is extracted. All monitoring several screens simultaneously is a complicated the i-vectors are then processed through three independent task, even for humans. classifiers. The first one is trained to classify violent and Additionally to VSD, and for the first time at this year’s non-violent scenes. The second one differentiate positive, MediaEval workshop, a second subtask is examined: In- neutral and negative valence. The third one is trained on duced Affect Detection. This subtask focuses on the impact the three different levels of arousal. emotions can have for video or movie suggestions. Each video scene is categorized depending of its valence class (pos- 2.2 Video approach itive - neutral - negative) and its arousal class (active - neu- Convolutional neural networks (ConvNets) are a state-of- tral - passive). The purpose here is to predict the feelings the-art technique in the field of object recognition within im- that a particular video will cause to an user in order to rec- ages. ConvNets applied to 2D images are adapted to capture ommend him similar or completely different content. spatial configurations. Using them to capture temporal in- formation related to changes between video frames requires using several frames as input. A drawback is that it signif- Copyright is held by the author/owner(s). icantly increases the dimensionality of the input. Thus, an MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany alternative approach consists of using optical flow maps as Table 1: ConvNet architecture Table 2: Mean Average Precision (MAP) on Vio- Ind Type Filter size Filter num Stride lence detection 1 Conv 7x7 32 2 Run MAP (%) 2 ReLU - - - i-vector - pLDA 9,56 3 Maxpool 3x3 - 2 optical flow maps - ConvNets 9,67 4 Normalization optical flow maps - ConvNets - HMDB-51 6,56 5 Conv 5x5 96 1 6 ReLU - - - 7 Maxpool 3x3 - 2 Table 3: Global accuracy for Affect detection. 8 Normalization (OFM stands for optical flow maps) 9 Conv 3x3 96 1 Run Valence Arousal 10 ReLU - - - (%) (%) 11 Maxpool 3x3 - 2 i-vector - pLDA 37.03 31.71 12 Conv 3x3 96 1 OFM - ConvNets 35.28 44.39 13 ReLU - - - OFM - ConvNets - HMDB-51 37.28 52.44 14 Maxpool 3x3 - 2 15 Conv 3x3 96 1 layers fully connected neural network for each of the three 16 ReLU - - - subtasks. 17 FC - 1024 - 18 Dropout - - - 19 ReLU - - - 3. RESULTS AND DISCUSSION 20 FC - 512 - We have submitted three runs for both subtasks. The 21 Dropout - - - results for VSD tasks are presented in Table 2. The Mean 22 ReLU - - - Average Precision (MAP) is computed for each run. We can 23 FC - 2 or 3 - see that using external data from HMDB in order to train 24 LogSoftMax - - - the feature extractor is less efficient than training the feature extractor on the MediaEval dataset. The i-vector & pLDA technique present similar results as the optical flow maps & ConvNets association. input. Each map represents the motion of each pixel be- The global accuracy for the affect detection task is shown tween two successive frames. if Table 3. For the valence, all methods give similar results. We used TV-L1 [4] algorithm from the OpenCV toolbox A difference appears for the arousal task. The audio features for optical flow extraction. We use 10 stacked optical flow perform poorly in comparison to the other runs. Using ex- frames as input. Note that 10 stacked optical flows equals ternal data proves here to be more interesting as the last run 20 maps given that both horizontal and vertical components significantly outperforms the second run. Motion seems to have to be provided. In order to reduce overfitting we use be an important discriminative factor for arousal estimation. dropout, as well as data augmentation by cropping and flip- ping randomly the maps of the input sequence. We also 3.1 Discussion estimate the motion of the camera by calculating the mean We have also investigated merging the audio and visual across the maps of the same component (horizontal and ver- features together. The features from the ConvNets extrac- tical), then we subtract the corresponding mean. Our sys- tor and the i-vectors were used as input to another neural tem is tested on the publicly available Torch toolbox [1] network. But the results were poorer than using the features which offers a powerful and varied set of tools, especially for separately. Further work will investigate audio-visual fusion building and training ConvNets. The details for the used more in depth. architecture are listed in Table 1. Using dense optical flow maps, means that the size of the 4. CONCLUSION neural network increases rapidly with the length of the se- In this paper we presented two approaches for both affect quence used as input. This implies that short sub-sequences and violent scene detection. Visual and audio features are of video frames (or rather optical flow maps) have to be processed separately. Both features are giving similar re- used as input to the ConvNet. This increases the risk that sults for violence detection and valence. For arousal, video those sub-sequences fall on parts of the video where there is features are far more interesting, especially when the Conv- no useful information for the identification of the category. Nets feature extractor is trained on external data. Our fu- To tackle this problem, we use a sliding window approach ture work will focus on the merging the audio and video at test time, estimating the probability for each category features. in several sub-sequences of the video. The class with the highest probability after averaging over all the different sub- sequence probabilities is selected as the most likely class. 5. ACKNOWLEDGEMENTS We also train a ConvNet with the same architecture on This work has been partly funded by the Walloon Region the HMDB-51 dataset [3] (action recognition benchmark), of Belgium through the Chist-Era IMOTION project (In- in order to build a more robust motion feature extractor telligent Multi-Modal Augmented Video Motion Retrieval leveraging this additional external data. Then, we extract System) and by the European Regional Development Fund features from the MediaEval annotated data and train a two (ERDF) through the DigiSTORM project. 6. REFERENCES [1] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop, number EPFL-CONF-192376, 2011. [2] B. Elizalde, H. Lei, and G. Friedland. An i-vector representation of acoustic environments for audio-based video event detection on user generated content. In Multimedia (ISM), 2013 IEEE International Symposium on, pages 114–117. IEEE, 2013. [3] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human motion recognition. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 2556–2563. IEEE, 2011. [4] J. S. Pérez, E. Meinhardt-Llopis, and G. Facciolo. TV-L1 optical flow estimation. Image Processing On Line, 2013:137–150, 2013. [5] S. O. Sadjadi, M. Slaney, and L. Heck. MSR identity toolbox v1. 0: A MATLAB toolbox for speaker recognition research. Speech and Language Processing Technical Committee Newsletter, 2013. [6] M. Sjöberg, B. Ionescu, H. Wang, Y. Baveye, E. Dellandréa, L. Chen, V. L. Quang, M. Schedl, and C.-H. Demarty. The MediaEval 2015 affective impact of movies task. In MediaEval 2015 Workshop, Wurzen, Germany, 2015.