1. INTRODUCTION

NII-UIT at MediaEval 2015 Affective Impact of Movies Task

Vu Lam

lqvu@fit.hcmus.edu.vn 2

Sang Phan

plsang@nii.ac.jp 0

Duy-Dinh Le

ledduy@nii.ac.jp 0

Shin'ichi Satoh

satoh@nii.ac.jp 0

Duc Anh Duong

ducda@uit.edu.vn 1 0 National Institute of , Informatics , Japan 1 University of Information, Technology , VNU-HCM 2 University of Science , VNU-HCM

2015

14 15

A ective Impact of Movies task aims to detect violent videos and a ective impact on viewers of that videos [9]. This is a challenging task not only because of the diversity of video content but also due to the subjectiveness of human emotion. In this paper, we present a uni ed framework that can be applied to both subtasks: (i) induce a ect detection, and (ii) violence detection. This framework is based on our previous year's Violent Scene Detection (VSD) framework. We extended it to support a ect detection by training di erent valence/arousal classes independently and combine them to make the nal decision. Besides using internal features from three di erent modalities: audio, image, and motion, in this year, we also incorporate deep learning features into our framework. Experimental results show that our uni ed framework can detect violent videos and its a ective impact with a reasonable accuracy. Moreover, using deep features can signi cantly improve the detection performance of both subtasks.

1. INTRODUCTION

Detecting a ective impact of movies requires combining multimedia features. For example, a violent video of carchase can be detected by searching for evidences such as fast moving of cars or possibly the sound of gun shooting. To this end, we have developed a framework that supports combining features from multiple modalities for violent scene detection. We consider the induced a ect detection as a multi-class classi cation task. Therefore, our framework can be applied to predict the valence and arousal class of a video as well. In general, our framework consists of three main components: feature extraction, feature encoding, feature classi cation. An overview of our framework is shown in Fig 1. 2.1

FEATURE EXTRACTION Image Features

At rst, we scale the original video into 320x240 pixels and then sample frames from video at every second. We use the standard SIFT feature with Hessian Laplace interest point detector to extract features from each frame [ 6 ]. Each frame is represented using the Fisher Vector encoding [ 7 ]. We use the average pooling strategy to aggregate framebased feature into the nal video representation, which has 40,960 dimensions. 2.2

Motion Feature

We use the Improved Trajectories [ 10 ] to extract dense trajectories. A combination of Histogram of Oriented Gradients (HOG), Histogram of Optical Flow (HOF) and Motion Boundary Histogram (MBH) is used to describe each trajectory. We encode HOGHOF and MBH features separately using the Fisher Vector encoding. The codebook size is 256, trained using a Gaussian Mixture Model (GMM). The feature representation of each descriptor after applying PCA has 65,536 dimensions. 2.3

Audio Feature

We use the popular Mel-frequency Cepstral Coe cients (MFCC) for extracting audio features. We choose a length of 25ms for audio segments and a step size of 10ms. The 13dimensional MFCC vectors along with each rst and second derivatives are used for representing each audio segment. Raw MFCC features are also encoded using Fisher vector encoding. We use a GMM to train the codebook with 256 Run 1 2 3 ext 4 ext

Deep Learning Feature

We use the popular DeepCa e [ 3 ] framework to extract image features. We used the pre-trained deep model provided by Simonyan and Zisserman [ 8 ]. This model was trained on ImageNet 1,000 concepts [ 2 ]. As suggested in [ 4 ], we selected the neuron activations from the last three layers for the feature representation. The third and secondto-last layer has 4,096 dimensions, while the last layer has 1,000 dimensions corresponding to the 1,000 concept categories in the ImageNet dataset. We denote these features as VDFC6, VDFC7, and VDFULL in our experiments. 2.5

Features from Past VSD Tasks

For the violent detection task, we also consider using featurs from past VSD tasks as external features. In particular, we use the features that were extracted in the VSD 2014 task for training the violent detector. These features include SIFT, Dense Trajectories (HOGHOF and MBH descriptors) and Audio MFCC which achieved the runner-up performance in VSD 2014 [ 5 ]. We denote these features as FOHGHOF, HBM, TFIS and CCFM in our experiments.

CLASSIFICATION

LibSVM [ 1 ] is used for training and testing our a ective impact detectors. For features that are encoded using the Fisher vector, we use linear kernel for training and testing. For deep learning feature, 2 kernel is used.

We divide the training videos into two subset. The rst 3,072 videos are used for training the model, while the remaining 3,072 videos are used for validation. To learn the decision threshold of each detector, we sample this threshold in the range from 0 to 1 with the step size of 0.01, and select the value that maximizes the F1 score.

In order to generate the decision for valence or arousal

Validation Results (mAP) O cial Results (Accuracy) Valence

detection, we need to make the decision from the predictions of all valence or arousal classes. To this end, we propose using two strategies: (1) MAX: select the class that has the highest prediction; (2) MAXREL: select the class that has the highest relative improvement from the learned threshold.

4. SUBMITTED RUNS

At rst, we use the late fusion with average weighting scheme to combine features from di erent modalities. After that we select the runs that have the top performance on the validation set to submit. The list of submitted runs for each subtask and its validation results can be seen on Table 1 and Table 2. 5.

RESULTS AND DISCUSSIONS

The o cial results for each subtask are shown on the last column of Table 1 and Table 2. For the violence detection task, we observe that the results of combining multiple features are more stable. For example, on the validation set, the run that combines all available features has the lowest performance. However, on the test set, this run achieves the best performance. This can be due to the fact that we only select one split for validation. For both subtasks, combining with deep learning features can signi cantly improve the detection performance. For the induced a ect detection task, we found that the strategy using the max detection score tends to have more stable performance. The best valence detection performance is obtained by combining all internal features with all deep learning feature using the max relative improvement strategy.

ACKNOWLEDGEMENTS

This research is partially funded by Vietnam National University Ho Chi Minh City (VNU-HCM) under grant number B2013-26-01.

[1]

C.-C.

Chang and

C.-J.

Lin . LIBSVM: A library for support vector machines . ACM Transactions on Intelligent Systems and Technology , 2 : 27 :1{ 27 : 27 , 2011 .

[2]

Deng ,

Dong ,

Socher ,

L.-J.

Li ,

Li , and

Fei-Fei . Imagenet: A large-scale hierarchical image database . In Computer Vision and Pattern Recognition (CVPR) , pages 248 { 255 . IEEE, 2009 .

[3]

Jia ,

Shelhamer ,

Donahue ,

Karayev ,

Long ,

Girshick ,

Guadarrama , and T. Darrell. Ca e: Convolutional architecture for fast feature embedding . In Proceedings of the ACM International Conference on Multimedia , pages 675 { 678 . ACM, 2014 .

[4]

Krizhevsky , I. Sutskever , and

G. E.

Hinton . Imagenet classi cation with deep convolutional neural networks . In Advances in neural information processing systems , pages 1097 { 1105 , 2012 .

[5]

Lam ,

Le ,

Phan ,

Satoh , and

D. A.

Duong . NII-UIT at mediaeval 2014 violent scenes detection a ect task . In Working Notes Proceedings of the MediaEval 2014 Workshop , Barcelona, Catalunya, Spain, October 16-17 , 2014 ., 2014 .

[6]

D. G.

Lowe . Distinctive image features from scale-invariant keypoints . International journal of computer vision , 60 ( 2 ): 91 { 110 , 2004 .

[7]

Sanchez ,

Perronnin ,

Mensink , and

Verbeek . Image classi cation with the sher vector: Theory and practice . International journal of computer vision , 105 ( 3 ): 222 { 245 , 2013 .

[8]

Simonyan and

Zisserman . Very deep convolutional networks for large-scale image recognition . arXiv preprint arXiv:1409.1556 , 2014 .

[9]

Sjo berg, Y. Baveye,

Wang ,

V. L.

Quang ,

Ionescu , E. Dellandrea,

Schedl , C.-H. Demarty , and L. Chen. The mediaeval 2015 a ective impact of movies task . In MediaEval 2015 Workshop , Wurzen, Germany, Septemper 14-15 2015 .

[10]

Wang and

Schmid . Action recognition with improved trajectories . In International Conference on Computer Vision (ICCV), pages 3551 { 3558 . IEEE, 2013 .