Fudan-Huawei at MediaEval 2015: Detecting Violent
Scenes and Affective Impact in Movies with Deep Learning

                                    Qi Dai1 , Rui-Wei Zhao1 , Zuxuan Wu1 , Xi Wang1 ,
                                       Zichen Gu2 , Wenhai Wu2 , Yu-Gang Jiang1⇤
                             1
                                 School of Computer Science, Fudan University, Shanghai, China
                                       2
                                         Media Lab, Huawei Technologies Co. Ltd., China


ABSTRACT
                                                                                                     Video Clips
Techniques for violent scene detection and a↵ective impact
prediction in videos can be deployed in many applications.                             Feature Extraction
In MediaEval 2015, we explore deep learning methods to
tackle this challenging problem. Our system consists of sev-                  FV-TrajShape
eral deep learning features. First, we train a Convolutional                                   CNN-
Neural Network (CNN) model with a subset of ImageNet                                           Spatial
                                                                              FV-HOG
classes selected particularly for violence detection. Second,                                               LSTM
we adopt a specially designed two-stream CNN framework [1]                    FV-HOF
to extract features on both static frames and motion optical                                   CNN-
flows. Third, Long Short Term Memory (LSTM) models are                                       Temporal
                                                                              FV-MBH
applied on top of the two-stream CNN features, which can
capture the longer-term temporal dynamics. In addition,
                                                                                STIP
several conventional motion and audio features are also ex-                                    CNN-
tracted as complementary information to the deep learning                                     Violence
                                                                               MFCC
features. By fusing all the advanced features, we achieve
a mean average precision of 0.296 in the violence detection
subtask, and an accuracy of 0.418 and 0.488 for arousal and
valence respectively in the induced a↵ect detection subtask.                                 SVM

1.     SYSTEM DESCRIPTION                                           Figure 1: The key components in our system.
  Figure 1 gives an overview of our system. In this short
paper, we briefly describe each of the key components. For
more information about the task definitions, interested read-       Two-stream CNN: Recent works [1, 4] have also re-
ers may refer to [2].                                            vealed the e↵ectiveness of the CNN models for video clas-
                                                                 sification. Video data could be naturally decomposed into
1.1      Features                                                two components, namely spatial and temporal respectively.
   We extract several features, including both neural network    Thus we adopt a two-stream (spatial stream and temporal
based features and the conventional hand-crafted ones, as        stream) CNN model to extract features. Specially, for the
described in the following.                                      spatial stream, a CNN model which was pre-trained on the
   CNN-Violence: The e↵ectiveness of CNN models has              ImageNet Challenge dataset (di↵erent from the 2614 classes
been verified on many visual recognition tasks like object       used in CNN-Violence) is used. The outputs of the last three
recognition. We train an AlexNet [3] based model on video        FC layers are used as the features. For the temporal stream,
frames, which takes individual frames as network inputs fol-     which aims to capture the motion information, a CNN model
lowed by several convolutional layers, pooling layers and        is trained to take stacked optical flows as input. The output
fully connected (FC) layers. Specially, a subset of ImageNet     of the last FC layer is used as features. For more details of
is used to tune the network. We manually pick 2614 classes       our two-stream CNN model used in this evaluation, please
which are relatively more related to violence (or its related    refer to [4]. Note that the models are not fine-tuned using
semantic ingredients). These classes are mostly among the        MediaEval data.
categories of scenes, people, weapons and actions. The out-         LSTM: In order to further model the long-term dynamic
puts of FC6 (i.e., the sixth FC layer; 4096-d), FC7 (4096-d)     information that is mostly discarded in the spatial and tem-
and FC8 (2614-d) are used as the features.                       poral stream CNNs, we utilize our recently developed LSTM
                                                                 model [5]. Di↵erent from a traditional Recurrent Neural
⇤
    Corresponding author. Email: ygj@fudan.edu.cn.               Network (RNN) unit, the LSTM unit has a built-in memory
                                                                 cell. Several non-linear gates are used to govern the informa-
                                                                 tion flow into and out of the cell, which enables the model to
Copyright is held by the author/owner(s).                        explore long-range dynamics. Figure 2 shows the structure
MediaEval 2015 Workshop, Sep. 14-15, 2015, Wurzen, Germany       of the LSTM model. With a two-stream CNN model, video
                                    Figure 2: The structure of the LSTM network.

                                                                                      0.6%
frames or stacked optical flows could be transformed to a se-
                                                                                                 0.479%%                         0.479%%         0.488%%          0.487%%
ries of fixed-length vector representations. The LSTM model                           0.5%                       0.463%%
is used to model these temporal information. Due to time                                                                     0.407%%        0.418%%        0.417%%
                                                                                             0.399%%         0.386%%


                                                                      Accuracy/MAP
constraint of the evaluation, we directly adopt LSTM model                            0.4%
trained with another video dataset (the UCF-101 dataset                                                                                               0.295%%         0.296%%
                                                                                      0.3%                                             0.270%%
                                                                                                                       0.235%%
[6]) and use the average output from all the time-steps of
the last LSTM layers as the feature (512-d).                                          0.2%             0.165%%
                                                                                                                                                                Valence%
   Conventional features: Same as last year [7], we also                                                                                                        Arousal%
                                                                                      0.1%
extract the improved dense trajectories (IDT) features ac-                                                                                                      Violence%
cording to [8]. Four trajectories based features, including                             0%
histograms of oriented gradients (HOG), histograms of op-                                       Run%1%           Run%2%          Run%3%          Run%4%          Run%5%
tical flow (HOF), motion boundary histograms (MBH) and           Figure 3: Performance of our 5 submitted runs on
trajectory shape (TrajShape) descriptors are computed. The       both a↵ect and violence subtasks. For the a↵ect
features are encoded using the Fisher vectors (FV) with a        subtask, accuracy is used as the official performance
codebook of 256 codewords. The other two kinds of conven-        measure. For the violence subtask, MAP is used.
tional features include Space-Time Interest Points (STIP) [9]
and Mel-Frequency Cepstral Coefficients (MFCC). The STIP            Figure 3 shows the results of all the submissions. The offi-
describes the texture and motion features around local inter-    cial performance measure is accuracy and MAP for the a↵ect
est points, which are encoded using the bag-of-words frame-      and violence subtasks respectively. We can see that the deep
work with 4000 codewords. The MFCC is a very popular             learning based features (Run 2) are significantly better than
audio feature. It is extracted from every 32ms time-window       the conventional features (Run 1) for the violence subtask,
with 50% overlap. The bag-of-words is also adopted to quan-      and both are comparable for the a↵ect subtask. This is
tize the MFCC descriptors, using 4000 codewords.                 possibly because the CNN-Violence feature is specially opti-
                                                                 mized for detecting violence. Comparing Run 3 with Run 1,
1.2   Classification                                             it is obvious that the CNN-Violence feature could improve
   We use SVM as the classifier. Linear kernel is used for       the result with a large margin for the violence subtask (from
the four IDT features, and 2 kernel is used for all the oth-     0.165 to 0.27), but the gain is much less significant for the
ers. For feature fusion, kernel level fusion is adopted, which   other subtask. In addition, the two-stream CNN also brings
linearly combines kernels computed on di↵erent features.         considerable improvement on both subtasks (Run 4). The
   Notice that direct classification with the CNN is feasible,   LSTM models seem to be ine↵ective (Run 5 vs. Run 4).
which may lead to better results. However, tuning the mod-       The reason is that the LSTM models were trained on the
els using MediaEval data requires additional computation.        UCF-101 dataset, which is very di↵erent from the data used
                                                                 in MediaEval. We expect clear improvements from LSTM if
                                                                 the models can be re-trained. Also, the contributions from
2.    SUBMITTED RUNS AND RESULTS                                 the CNN-based models could probably be even more signif-
   There are two subtasks in this year’s evaluation, namely      icant if re-trained on MediaEval data. Overall, we conclude
violence detection and induced a↵ect detection. Induced af-      that deep learning features are very e↵ective for this task and
fect detection requires participants to predict two emotional    the room for improvements is huge with model re-training.
impacts, arousal and valence, of a video clip.
   We submit five runs for each subtask. For both subtasks,      Acknowledgements
Run 1 uses the conventional features, Run 2 uses all the deep    This work was supported in part by a Key Technologies Re-
learning features, Run 3 combines Run 1 and the CNN-             search and Development Program of China (#2013BAH09F01),
Violence feature, Run 4 further includes the two-stream          a National 863 Program of China (#2014AA015101), a grant
CNN features, and, finally, Run 5 fuses all the features.        from NSFC (#61201387), and Huawei Technologies.
3.   REFERENCES
[1] K. Simonyan and A. Zisserman. Two-stream
    convolutional networks for action recognition in videos.
    In NIPS, 2014.
[2] M. Sjöberg, Y. Baveye, H. Wang, V. L. Quang,
    B. Ionescu, E. Dellandréa, M. Schedl, C.-H. Demarty,
    L. Chen. The MediaEval 2015 A↵ective Impact of
    Movies Task. In MediaEval 2015 Workshop, 2015.
[3] A. Krizhevsky, I. Sutskever, G. E. Hinton. Image-Net
    classification with deep convolutional neural networks.
    In NIPS, 2012.
[4] H. Ye, Z. Wu, R.-W. Zhao, X. Wang, Y.-G. Jiang,
    X. Xue. Evaluating Two-Stream CNN for Video
    Classification points. In ICMR, 2015.
[5] Z. Wu, X. Wang, Y.-G. Jiang et al. Modeling Spatial-
    Temporal Clues in a Hybrid Deep Learning Framework
    for Video Classification. In ACM Multimedia, 2015.
[6] K. Soomro, A. R. Zamir and M. Shah. UCF101: A
    Dataset of 101 Human Action Classes From Videos in
    The Wild. In CRCV-TR-12-01, 2012.
[7] Q. Dai, Z. Wu, Y.-G. Jiang et al. Fudan-NJUST at
    MediaEval 2014: Violent Scenes Detection Using Deep
    Neural Networks. In MediaEval 2014 Workshop, 2014.
[8] H. Wang, C. Schmid. Action Recognition With
    Improved Trajectories. In ICCV, 2013.
[9] I. Laptev. On space-time interest points. IJCV,
    64:107–123, 2005.