Fudan-Huawei at MediaEval 2015: Detecting Violent Scenes and Affective Impact in Movies with Deep Learning Qi Dai1 , Rui-Wei Zhao1 , Zuxuan Wu1 , Xi Wang1 , Zichen Gu2 , Wenhai Wu2 , Yu-Gang Jiang1⇤ 1 School of Computer Science, Fudan University, Shanghai, China 2 Media Lab, Huawei Technologies Co. Ltd., China ABSTRACT Video Clips Techniques for violent scene detection and a↵ective impact prediction in videos can be deployed in many applications. Feature Extraction In MediaEval 2015, we explore deep learning methods to tackle this challenging problem. Our system consists of sev- FV-TrajShape eral deep learning features. First, we train a Convolutional CNN- Neural Network (CNN) model with a subset of ImageNet Spatial FV-HOG classes selected particularly for violence detection. Second, LSTM we adopt a specially designed two-stream CNN framework [1] FV-HOF to extract features on both static frames and motion optical CNN- flows. Third, Long Short Term Memory (LSTM) models are Temporal FV-MBH applied on top of the two-stream CNN features, which can capture the longer-term temporal dynamics. In addition, STIP several conventional motion and audio features are also ex- CNN- tracted as complementary information to the deep learning Violence MFCC features. By fusing all the advanced features, we achieve a mean average precision of 0.296 in the violence detection subtask, and an accuracy of 0.418 and 0.488 for arousal and valence respectively in the induced a↵ect detection subtask. SVM 1. SYSTEM DESCRIPTION Figure 1: The key components in our system. Figure 1 gives an overview of our system. In this short paper, we briefly describe each of the key components. For more information about the task definitions, interested read- Two-stream CNN: Recent works [1, 4] have also re- ers may refer to [2]. vealed the e↵ectiveness of the CNN models for video clas- sification. Video data could be naturally decomposed into 1.1 Features two components, namely spatial and temporal respectively. We extract several features, including both neural network Thus we adopt a two-stream (spatial stream and temporal based features and the conventional hand-crafted ones, as stream) CNN model to extract features. Specially, for the described in the following. spatial stream, a CNN model which was pre-trained on the CNN-Violence: The e↵ectiveness of CNN models has ImageNet Challenge dataset (di↵erent from the 2614 classes been verified on many visual recognition tasks like object used in CNN-Violence) is used. The outputs of the last three recognition. We train an AlexNet [3] based model on video FC layers are used as the features. For the temporal stream, frames, which takes individual frames as network inputs fol- which aims to capture the motion information, a CNN model lowed by several convolutional layers, pooling layers and is trained to take stacked optical flows as input. The output fully connected (FC) layers. Specially, a subset of ImageNet of the last FC layer is used as features. For more details of is used to tune the network. We manually pick 2614 classes our two-stream CNN model used in this evaluation, please which are relatively more related to violence (or its related refer to [4]. Note that the models are not fine-tuned using semantic ingredients). These classes are mostly among the MediaEval data. categories of scenes, people, weapons and actions. The out- LSTM: In order to further model the long-term dynamic puts of FC6 (i.e., the sixth FC layer; 4096-d), FC7 (4096-d) information that is mostly discarded in the spatial and tem- and FC8 (2614-d) are used as the features. poral stream CNNs, we utilize our recently developed LSTM model [5]. Di↵erent from a traditional Recurrent Neural ⇤ Corresponding author. Email: ygj@fudan.edu.cn. Network (RNN) unit, the LSTM unit has a built-in memory cell. Several non-linear gates are used to govern the informa- tion flow into and out of the cell, which enables the model to Copyright is held by the author/owner(s). explore long-range dynamics. Figure 2 shows the structure MediaEval 2015 Workshop, Sep. 14-15, 2015, Wurzen, Germany of the LSTM model. With a two-stream CNN model, video Figure 2: The structure of the LSTM network. 0.6% frames or stacked optical flows could be transformed to a se- 0.479%% 0.479%% 0.488%% 0.487%% ries of fixed-length vector representations. The LSTM model 0.5% 0.463%% is used to model these temporal information. Due to time 0.407%% 0.418%% 0.417%% 0.399%% 0.386%% Accuracy/MAP constraint of the evaluation, we directly adopt LSTM model 0.4% trained with another video dataset (the UCF-101 dataset 0.295%% 0.296%% 0.3% 0.270%% 0.235%% [6]) and use the average output from all the time-steps of the last LSTM layers as the feature (512-d). 0.2% 0.165%% Valence% Conventional features: Same as last year [7], we also Arousal% 0.1% extract the improved dense trajectories (IDT) features ac- Violence% cording to [8]. Four trajectories based features, including 0% histograms of oriented gradients (HOG), histograms of op- Run%1% Run%2% Run%3% Run%4% Run%5% tical flow (HOF), motion boundary histograms (MBH) and Figure 3: Performance of our 5 submitted runs on trajectory shape (TrajShape) descriptors are computed. The both a↵ect and violence subtasks. For the a↵ect features are encoded using the Fisher vectors (FV) with a subtask, accuracy is used as the official performance codebook of 256 codewords. The other two kinds of conven- measure. For the violence subtask, MAP is used. tional features include Space-Time Interest Points (STIP) [9] and Mel-Frequency Cepstral Coefficients (MFCC). The STIP Figure 3 shows the results of all the submissions. The offi- describes the texture and motion features around local inter- cial performance measure is accuracy and MAP for the a↵ect est points, which are encoded using the bag-of-words frame- and violence subtasks respectively. We can see that the deep work with 4000 codewords. The MFCC is a very popular learning based features (Run 2) are significantly better than audio feature. It is extracted from every 32ms time-window the conventional features (Run 1) for the violence subtask, with 50% overlap. The bag-of-words is also adopted to quan- and both are comparable for the a↵ect subtask. This is tize the MFCC descriptors, using 4000 codewords. possibly because the CNN-Violence feature is specially opti- mized for detecting violence. Comparing Run 3 with Run 1, 1.2 Classification it is obvious that the CNN-Violence feature could improve We use SVM as the classifier. Linear kernel is used for the result with a large margin for the violence subtask (from the four IDT features, and 2 kernel is used for all the oth- 0.165 to 0.27), but the gain is much less significant for the ers. For feature fusion, kernel level fusion is adopted, which other subtask. In addition, the two-stream CNN also brings linearly combines kernels computed on di↵erent features. considerable improvement on both subtasks (Run 4). The Notice that direct classification with the CNN is feasible, LSTM models seem to be ine↵ective (Run 5 vs. Run 4). which may lead to better results. However, tuning the mod- The reason is that the LSTM models were trained on the els using MediaEval data requires additional computation. UCF-101 dataset, which is very di↵erent from the data used in MediaEval. We expect clear improvements from LSTM if the models can be re-trained. Also, the contributions from 2. SUBMITTED RUNS AND RESULTS the CNN-based models could probably be even more signif- There are two subtasks in this year’s evaluation, namely icant if re-trained on MediaEval data. Overall, we conclude violence detection and induced a↵ect detection. Induced af- that deep learning features are very e↵ective for this task and fect detection requires participants to predict two emotional the room for improvements is huge with model re-training. impacts, arousal and valence, of a video clip. We submit five runs for each subtask. For both subtasks, Acknowledgements Run 1 uses the conventional features, Run 2 uses all the deep This work was supported in part by a Key Technologies Re- learning features, Run 3 combines Run 1 and the CNN- search and Development Program of China (#2013BAH09F01), Violence feature, Run 4 further includes the two-stream a National 863 Program of China (#2014AA015101), a grant CNN features, and, finally, Run 5 fuses all the features. from NSFC (#61201387), and Huawei Technologies. 3. REFERENCES [1] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014. [2] M. Sjöberg, Y. Baveye, H. Wang, V. L. Quang, B. Ionescu, E. Dellandréa, M. Schedl, C.-H. Demarty, L. Chen. The MediaEval 2015 A↵ective Impact of Movies Task. In MediaEval 2015 Workshop, 2015. [3] A. Krizhevsky, I. Sutskever, G. E. Hinton. Image-Net classification with deep convolutional neural networks. In NIPS, 2012. [4] H. Ye, Z. Wu, R.-W. Zhao, X. Wang, Y.-G. Jiang, X. Xue. Evaluating Two-Stream CNN for Video Classification points. In ICMR, 2015. [5] Z. Wu, X. Wang, Y.-G. Jiang et al. Modeling Spatial- Temporal Clues in a Hybrid Deep Learning Framework for Video Classification. In ACM Multimedia, 2015. [6] K. Soomro, A. R. Zamir and M. Shah. UCF101: A Dataset of 101 Human Action Classes From Videos in The Wild. In CRCV-TR-12-01, 2012. [7] Q. Dai, Z. Wu, Y.-G. Jiang et al. Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks. In MediaEval 2014 Workshop, 2014. [8] H. Wang, C. Schmid. Action Recognition With Improved Trajectories. In ICCV, 2013. [9] I. Laptev. On space-time interest points. IJCV, 64:107–123, 2005.