NII-UIT at MediaEval 2015 Affective Impact of Movies Task Vu Lam Sang Phan University of Science, National Institute of VNU-HCM Informatics, Japan lqvu@fit.hcmus.edu.vn plsang@nii.ac.jp Duy-Dinh Le Shin’ichi Satoh Duc Anh Duong National Institute of National Institute of University of Information Informatics, Japan Informatics, Japan Technology, VNU-HCM ledduy@nii.ac.jp satoh@nii.ac.jp ducda@uit.edu.vn ABSTRACT Affective Impact of Movies task aims to detect violent videos and affective impact on viewers of that videos [9]. This is a challenging task not only because of the diversity of video content but also due to the subjectiveness of human emo- tion. In this paper, we present a unified framework that can be applied to both subtasks: (i) induce affect detection, and (ii) violence detection. This framework is based on our previous year’s Violent Scene Detection (VSD) framework. We extended it to support affect detection by training differ- ent valence/arousal classes independently and combine them to make the final decision. Besides using internal features from three different modalities: audio, image, and motion, in this year, we also incorporate deep learning features into our framework. Experimental results show that our unified Figure 1: Our framework for extracting and encoding local framework can detect violent videos and its affective impact features. with a reasonable accuracy. Moreover, using deep features can significantly improve the detection performance of both subtasks. use the standard SIFT feature with Hessian Laplace interest point detector to extract features from each frame [6]. Each 1. INTRODUCTION frame is represented using the Fisher Vector encoding [7]. Detecting affective impact of movies requires combining We use the average pooling strategy to aggregate frame- multimedia features. For example, a violent video of car- based feature into the final video representation, which has chase can be detected by searching for evidences such as 40,960 dimensions. fast moving of cars or possibly the sound of gun shooting. To this end, we have developed a framework that supports 2.2 Motion Feature combining features from multiple modalities for violent scene We use the Improved Trajectories [10] to extract dense detection. We consider the induced affect detection as a trajectories. A combination of Histogram of Oriented Gra- multi-class classification task. Therefore, our framework can dients (HOG), Histogram of Optical Flow (HOF) and Mo- be applied to predict the valence and arousal class of a video tion Boundary Histogram (MBH) is used to describe each as well. In general, our framework consists of three main trajectory. We encode HOGHOF and MBH features sepa- components: feature extraction, feature encoding, feature rately using the Fisher Vector encoding. The codebook size classification. An overview of our framework is shown in is 256, trained using a Gaussian Mixture Model (GMM). Fig 1. The feature representation of each descriptor after applying PCA has 65,536 dimensions. 2. FEATURE EXTRACTION 2.3 Audio Feature 2.1 Image Features We use the popular Mel-frequency Cepstral Coefficients At first, we scale the original video into 320x240 pixels (MFCC) for extracting audio features. We choose a length and then sample frames from video at every second. We of 25ms for audio segments and a step size of 10ms. The 13- dimensional MFCC vectors along with each first and second derivatives are used for representing each audio segment. Copyright is held by the author/owner(s). Raw MFCC features are also encoded using Fisher vector MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany encoding. We use a GMM to train the codebook with 256 Table 1: Submitted violence detection runs and official results. Run Features Validation Results (mAP) Official Results (mAP) 1 HOGHOF+MBH+MFCC 0.2200 0.2039 2 HOGHOF+MBH+SIFT+MFCC 0.2094 0.2087 3 ext HOGHOF+MBH+MFCC+VDFULL 0.2457 0.2380 4 ext HOGHOF+MBH+MFCC+VDFULL+HBM 0.2499 0.2196 HOGHOF+MBH+MFCC+VDFULL+VDFC6 5 ext 0.1930 0.2684 +VDFC7+FOHGOH+HBM+TFIS+CCFM Table 2: Submitted induced affect detection runs and official results. Validation Results (mAP) Official Results (Accuracy) Run Features Decision Strategy Valence Arousal Valence Arousal 1 HOGHOF+MBH+SIFT+MFCC MAXREL 0.4148 0.3998 39.823 35.723 2 HOGHOF+MBH+SIFT+MFCC MAX 0.4148 0.3998 41.653 55.908 HOGHOF+MBH+SIFT+MFCC 3 ext MAXREL 0.4376 0.3958 42.956 55.677 +VDFULL+VDFC6+VDFC7 HOGHOF+MBH+SIFT+MFCC 4 ext MAX 0.4376 0.3958 42.914 55.656 +VDFULL+VDFC6+VDFC7 clusters. For audio features, we do not use PCA. The final detection, we need to make the decision from the predictions feature descriptor has 19,968 dimensions. of all valence or arousal classes. To this end, we propose using two strategies: (1) MAX: select the class that has the 2.4 Deep Learning Feature highest prediction; (2) MAXREL: select the class that has We use the popular DeepCaffe [3] framework to extract the highest relative improvement from the learned threshold. image features. We used the pre-trained deep model pro- vided by Simonyan and Zisserman [8]. This model was trained on ImageNet 1,000 concepts [2]. As suggested in 4. SUBMITTED RUNS [4], we selected the neuron activations from the last three At first, we use the late fusion with average weighting layers for the feature representation. The third and second- scheme to combine features from different modalities. After to-last layer has 4,096 dimensions, while the last layer has that we select the runs that have the top performance on 1,000 dimensions corresponding to the 1,000 concept cate- the validation set to submit. The list of submitted runs for gories in the ImageNet dataset. We denote these features as each subtask and its validation results can be seen on Table VDFC6, VDFC7, and VDFULL in our experiments. 1 and Table 2. 2.5 Features from Past VSD Tasks For the violent detection task, we also consider using fea- 5. RESULTS AND DISCUSSIONS turs from past VSD tasks as external features. In partic- The official results for each subtask are shown on the last ular, we use the features that were extracted in the VSD column of Table 1 and Table 2. For the violence detection 2014 task for training the violent detector. These features task, we observe that the results of combining multiple fea- include SIFT, Dense Trajectories (HOGHOF and MBH de- tures are more stable. For example, on the validation set, scriptors) and Audio MFCC which achieved the runner-up the run that combines all available features has the lowest performance in VSD 2014 [5]. We denote these features as performance. However, on the test set, this run achieves the FOHGHOF, HBM, TFIS and CCFM in our experiments. best performance. This can be due to the fact that we only select one split for validation. For both subtasks, combining with deep learning features can significantly improve the de- 3. CLASSIFICATION tection performance. For the induced affect detection task, LibSVM [1] is used for training and testing our affective we found that the strategy using the max detection score impact detectors. For features that are encoded using the tends to have more stable performance. The best valence Fisher vector, we use linear kernel for training and testing. detection performance is obtained by combining all internal For deep learning feature, χ2 kernel is used. features with all deep learning feature using the max relative We divide the training videos into two subset. The first improvement strategy. 3,072 videos are used for training the model, while the re- maining 3,072 videos are used for validation. To learn the decision threshold of each detector, we sample this threshold 6. ACKNOWLEDGEMENTS in the range from 0 to 1 with the step size of 0.01, and select This research is partially funded by Vietnam National the value that maximizes the F1 score. University Ho Chi Minh City (VNU-HCM) under grant num- In order to generate the decision for valence or arousal ber B2013-26-01. 7. REFERENCES [1] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. [2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition (CVPR), pages 248–255. IEEE, 2009. [3] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia, pages 675–678. ACM, 2014. [4] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. [5] V. Lam, D. Le, S. Phan, S. Satoh, and D. A. Duong. NII-UIT at mediaeval 2014 violent scenes detection affect task. In Working Notes Proceedings of the MediaEval 2014 Workshop, Barcelona, Catalunya, Spain, October 16-17, 2014., 2014. [6] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004. [7] J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek. Image classification with the fisher vector: Theory and practice. International journal of computer vision, 105(3):222–245, 2013. [8] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [9] M. Sjöberg, Y. Baveye, H. Wang, V. L. Quang, B. Ionescu, E. Dellandrea, M. Schedl, C.-H. Demarty, and L. Chen. The mediaeval 2015 affective impact of movies task. In MediaEval 2015 Workshop, Wurzen, Germany, Septemper 14-15 2015. [10] H. Wang and C. Schmid. Action recognition with improved trajectories. In International Conference on Computer Vision (ICCV), pages 3551–3558. IEEE, 2013.