1. INTRODUCTION

UMons at MediaEval 2015 Affective Impact of Movies Task including Violent Scenes Detection

Omar Seddati

omar.seddati@umons.ac.be 1

Emre Kulah

emre.kulah@ceng.metu.edu.tr 0

Gueorgui Pironkov

gueorgui.pironkov@umons.ac.be 1

Stéphane Dupont

stephane.dupont@umons.ac.be 1

Saïd Mahmoudi

said.mahmoudi@umons.ac.be 1

Thierry Dutoit

thierry.dutoit@umons.ac.be 1 0 Middle East Technical University , Ankara , Turkey 1 University of Mons , Belgium

2015

14 15

In this paper, we present the work done at UMons regarding the MediaEval 2015 A ective Impact of Movies Task (including Violent Scenes Detection). This task can be divided into two subtasks. On the one hand, Violent Scene Detection, which means automatically nding scenes that are violent in a set if videos. And on the other hand, evaluate the a ective impact of the video, through an estimation of the valence and arousal. In order to o er a solution for both detection and classi cation subtasks, we investigate different visual and auditory feature extraction methods. An i-vector approach is applied for the audio, and optical ow maps processed through a deep convolutional neural network are tested for extracting features from the video. Classi ers based on probabilistic linear discriminant analysis and fully connected feed-forward neural networks are then used.

1. INTRODUCTION

With the increasing amount of video content available, the aim of MediaEval 2015 \A ective Impact of Movies Task\ is to show users (depending on their age, preferences or mood) the content they are looking for. More precisely, this year the task focuses on two di erent aspects.

The rst subtask is Violent Scene Detection (VSD), the goal being to alert parents about the potentially violent content of a video. Thus, the criterion for VSD used for annotation is: \videos one would not let an 8 years old child see because of their physical violence\. Another possible application could be facilitating video surveillance alerts, as monitoring several screens simultaneously is a complicated task, even for humans.

Additionally to VSD, and for the rst time at this year's MediaEval workshop, a second subtask is examined: Induced A ect Detection. This subtask focuses on the impact emotions can have for video or movie suggestions. Each video scene is categorized depending of its valence class (positive - neutral - negative) and its arousal class (active - neutral - passive). The purpose here is to predict the feelings that a particular video will cause to an user in order to recommend him similar or completely di erent content. Both subtasks are examined on the same dataset. Around 10,000 video clips from professional and amateur movies are used, all under Creative Commons license. More information about these subtasks can be found in [ 6 ]. 2.

APPROACH

We use the same techniques for the VSD and a ect detection subtasks. In our approach audio and video information are analyzed separately. Thus, two di erent feature extraction methods are applied depending of the features. 2.1

Audio approach

For the audio processing we use the same method as [ 2 ], where i-vectors and Probabilistic Linear Discriminant Analysis (pLDA) are used to classify environments (wedding ceremony, birthday party, parade, etc.). The i-vector approach consists of extracting a low-dimensional feature vector from high-dimensional data without losing most of the relevant acoustic information. This method was introduced by the speaker recognition community and has also proven its efciency in language detection or in speaker adaptation for speech recognition.

In order to extract the i-vectors and classify them through pLDA, we have used the Matlab MSR Identity Toolbox [ 5 ]. For each audio track of the video shots, we extract 20 Melfrequency cepstral coe cients, and the associated rst and second derivatives. Thus, we use as input 60-dimensional features with a xed length of 800 frames for each shot. For each shot a 100-dimensional i-vector is extracted. All the i-vectors are then processed through three independent classi ers. The rst one is trained to classify violent and non-violent scenes. The second one di erentiate positive, neutral and negative valence. The third one is trained on the three di erent levels of arousal. 2.2

Video approach

Convolutional neural networks (ConvNets) are a state-ofthe-art technique in the eld of object recognition within images. ConvNets applied to 2D images are adapted to capture spatial con gurations. Using them to capture temporal information related to changes between video frames requires using several frames as input. A drawback is that it significantly increases the dimensionality of the input. Thus, an alternative approach consists of using optical ow maps as input. Each map represents the motion of each pixel between two successive frames.

We used TV-L1 [ 4 ] algorithm from the OpenCV toolbox for optical ow extraction. We use 10 stacked optical ow frames as input. Note that 10 stacked optical ows equals 20 maps given that both horizontal and vertical components have to be provided. In order to reduce over tting we use dropout, as well as data augmentation by cropping and ipping randomly the maps of the input sequence. We also estimate the motion of the camera by calculating the mean across the maps of the same component (horizontal and vertical), then we subtract the corresponding mean. Our system is tested on the publicly available Torch toolbox [ 1 ] which o ers a powerful and varied set of tools, especially for building and training ConvNets. The details for the used architecture are listed in Table 1.

Using dense optical ow maps, means that the size of the neural network increases rapidly with the length of the sequence used as input. This implies that short sub-sequences of video frames (or rather optical ow maps) have to be used as input to the ConvNet. This increases the risk that those sub-sequences fall on parts of the video where there is no useful information for the identi cation of the category. To tackle this problem, we use a sliding window approach at test time, estimating the probability for each category in several sub-sequences of the video. The class with the highest probability after averaging over all the di erent subsequence probabilities is selected as the most likely class.

We also train a ConvNet with the same architecture on the HMDB-51 dataset [ 3 ] (action recognition benchmark), in order to build a more robust motion feature extractor leveraging this additional external data. Then, we extract features from the MediaEval annotated data and train a two

RESULTS AND DISCUSSION

We have submitted three runs for both subtasks. The results for VSD tasks are presented in Table 2. The Mean Average Precision (MAP) is computed for each run. We can see that using external data from HMDB in order to train the feature extractor is less e cient than training the feature extractor on the MediaEval dataset. The i-vector & pLDA technique present similar results as the optical ow maps & ConvNets association.

The global accuracy for the a ect detection task is shown if Table 3. For the valence, all methods give similar results. A di erence appears for the arousal task. The audio features perform poorly in comparison to the other runs. Using external data proves here to be more interesting as the last run signi cantly outperforms the second run. Motion seems to be an important discriminative factor for arousal estimation. 3.1

Discussion

We have also investigated merging the audio and visual features together. The features from the ConvNets extractor and the i-vectors were used as input to another neural network. But the results were poorer than using the features separately. Further work will investigate audio-visual fusion more in depth. 4.

CONCLUSION

In this paper we presented two approaches for both a ect and violent scene detection. Visual and audio features are processed separately. Both features are giving similar results for violence detection and valence. For arousal, video features are far more interesting, especially when the ConvNets feature extractor is trained on external data. Our future work will focus on the merging the audio and video features.

ACKNOWLEDGEMENTS

This work has been partly funded by the Walloon Region of Belgium through the Chist-Era IMOTION project (Intelligent Multi-Modal Augmented Video Motion Retrieval System) and by the European Regional Development Fund (ERDF) through the DigiSTORM project.

[1]

Collobert ,

Kavukcuoglu , and

Farabet . Torch7: A matlab-like environment for machine learning . In BigLearn, NIPS Workshop, number EPFL-CONF-192376 , 2011 .

[2]

Elizalde ,

Lei , and G. Friedland. An i-vector representation of acoustic environments for audio-based video event detection on user generated content . In Multimedia (ISM) , 2013 IEEE International Symposium on , pages 114 { 117 . IEEE, 2013 .

[3]

Kuehne ,

Jhuang , E. Garrote,

Poggio , and

Serre . HMDB: a large video database for human motion recognition . In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 2556 { 2563 . IEEE, 2011 .

[4]

J. S.

Perez ,

Meinhardt-Llopis , and G. Facciolo. TV-L1 optical ow estimation . Image Processing On Line , 2013 : 137 { 150 , 2013 .

[5]

S. O.

Sadjadi ,

Slaney , and

Heck . MSR identity toolbox v1 . 0: A MATLAB toolbox for speaker recognition research . Speech and Language Processing Technical Committee Newsletter , 2013 .

[6]

Sjo berg,

Ionescu ,

Wang ,

Baveye ,

Dellandrea ,

Chen ,

V. L.

Quang ,

Schedl , and

C.-H.

Demarty . The MediaEval 2015 a ective impact of movies task . In MediaEval 2015 Workshop , Wurzen, Germany, 2015 .