Fudan-NJUST at MediaEval 2014: Violent Scenes Detection
             Using Deep Neural Networks

                    Qi Dai§ , Zuxuan Wu§ , Yu-Gang Jiang§ , Xiangyang Xue§ , Jinhui Tang‡
                                   §
                                School of Computer Science, Fudan University, Shanghai
            ‡
                School of Computer Science and Engineering, Nanjing University of Science and Technology
                     {daiqi,zxwu,ygj,xyxue}@fudan.edu.cn, jinhuitang@mail.njust.edu.cn

ABSTRACT
The Violent Scenes Detection task aims at evaluating algo-                                                                                      Video Clips
rithms that automatically localize violent segments in both
Hollywood movies and short web videos. The definition of                                                Feature Extraction
violence is subjective: “the segments that one would not
let an 8 years old child see in a movie because they con-


                                                                                                                           TrajMF-HOG


                                                                                                                                                     TrajMF-MBH
                                                                                                                                        TrajMF-HOF
                                                                                                          FV-TrajShape
                                                                     FV-HOG


                                                                                               FV-MBH
                                                                                  FV-HOF


                                                                                                                                                                         MFCC
tain physical violence”. This is a highly challenging problem


                                                                                                                                                                  STIP
because of the strong content variations among the posi-
tive instances. In this year’s evaluation, we adopted our
recently proposed classification method to fuse multiple fea-
tures using Deep Neural Networks (DNN). The method was
named regularized DNN. We extracted a set of visual and
audio features, which have been observed useful. We then              SVM                                                Fusion                                      DNN
applied the regularized DNN for feature fusion and classifi-
cation. Results indicate that using multiple features is still      Merging                                                                                        Merging
very helpful, and more importantly, our proposed regular-
                                                                                                                                        Smoothing,
ized DNN oﬀers significantly better results than the popular                                            Merging                        &Merging
SVM. We achieved a mean average precision of 0.63 for the                     1                                                                                          2
main task and 0.60 for the generalization task.
                                                                                           5                  3                                      4
1. SYSTEM DESCRIPTION
  Figure 1 gives an overview of our system. In this short         Figure 1: An overview of the key components in
paper, we briefly describe each of the key components. For        our system, where circled numbers indicate the 5
the task definition, data and evaluation metric, interested       submitted runs.
readers may refer to [1].
                                                                  ture. In total, there are seven trajectory-based features, in-
1.1 Features                                                      cluding four baseline FV and three dimension-reduced Tra-
   Three kinds of audio-visual features were extracted, which     jMF features. See [2] for more details.
have been observed useful in 2013.                                   The other two kinds of features include Space-Time In-
   We extracted trajectory-based motion features according        terest Points (STIP) [5] and Mel-Frequency Cepstral Coeﬃ-
to our previous work [2]. A main diﬀerence is that the            cients (MFCC). The STIP describes the texture and motion
new improved dense trajectories (IDT) [4] were used as the        features around local interest points, which were encoded us-
basis to replace the original dense trajectories. Four base-      ing the bag-of-words framework with 4000 codewords. Here
line features, histograms of oriented gradients (HOG), his-       we randomly sampled 300k features and used k-means to
tograms of optical flow (HOF), motion boundary histograms         generate the codebook. The MFCC is a very popular au-
(MBH) and trajectory shape (TrajShape) descriptors were           dio feature. It was extracted from every 32ms time-window
computed. These features were encoded using the Fisher            with 50% overlap. The bag-of-words was also adopted to
vectors (FV) with a codebook of 256 codewords. We further         quantize the MFCC descriptors, using 4000 codewords.
computed our proposed TrajMF [2] based on the HOG, HOF
and MBH, by considering the motion relationships of the           1.2         Classifiers
trajectories. As the dimension of the original TrajMF is very       We adopted both SVM and deep neural networks (DNN)
high, we employed the expectation-maximization principal          for classification.
component analysis (EM-PCA) [3] for dimension reduction,            SVM: χ2 kernel was adopted for the bag-of-words fea-
generating a 1500-dimensional representation for each fea-        tures (STIP and MFCC), and linear kernel was used for the
                                                                  others. For feature fusion, kernel-level average fusion was
                                                                  used for the trajectory-based features, while score-level av-
Copyright is held by the author/owner(s).                         erage late fusion was adopted to combine trajectory features
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain.   with STIP and MFCC.
                                                                              0.7%
Classification                                                                       Main%Task%                                     0.63%
                                                                                                             0.604%
                                                                l=L           0.6%   Generaliza;on%Task%                                                 0.552%
                                                                                                                           0.538%
                                                                                          0.494%                                            0.5%   0.514%
                                            W   L−1                           0.5%                   0.454%
                                                                                     0.409%                           0.404%


                                                                       MAP
                                                                              0.4%
Feature Fusion                                                  l=F
                                                                              0.3%

                    ......         ......   ......                            0.2%
          W1E                W2E                          W3E
Feature                                                                       0.1%
                                                                l =E
Extraction
                                                                                0%
    W11                      W21                  W31                                   Run%1%             Run%2%        Run%3%      Run%4%           Run%5%


                                                                l =1   Figure 3: Performance of our 5 submitted runs on
             xn,1                   xn,2                xn,3           both main and generalization tasks. Note that, fol-
                                                                       lowing this year’s guideline, a specially designed
                                                                       MAP was used (MAP2014 [1])
Figure 2: Illustration of the structure of our regu-
larized DNN. Multiple features are used as the in-                     of Run 3 (smoothing was performed before merging), while
puts, and the network transforms the features sep-                     Run 5 is the direct fusion of SVM and DNN without using
arately first, before using regularizations to explore                 any smoothing and merging functions.
feature relationships. The identified relationships                       The oﬃcial results are summarized in Figure 3. We see
are then utilized for improved classification perfor-                  that, although some features were not used in DNN, the per-
mance. This figure is reprinted from [7].                              formance of DNN (Run 2) is still significantly better than
                                                                       SVM. This clearly confirms the eﬀectiveness of deep net-
   DNN: We also adopted a new DNN-based classifier pro-                works. Directly fusing DNN and SVM incurs a small per-
posed in our recent work [6, 7]. The aforementioned fusion             formance drop (Run 3). This may be due to the sub-optimal
methods used for the SVM classifiers neglect the hidden pat-           parameters used in the fusion process. Another fusion set-
terns shared among the diﬀerent features. To capture the re-           ting (Run 5) without using score merging improves the main
lationships of distinct features, we constructed a regularized         task performance but still hurts the result of the generaliza-
DNN for video classification. Specifically, as shown in Fig-           tion task, showing that DNN has better generalization ca-
ure 2, in the regularized DNN, a layer of neurons were first           pability than the SVM, and thus fusing SVM with DNN will
used to perform feature abstraction separately for each input          always degrade the performance of the generalization task.
feature. After that, another layer was used for feature fu-            Finally, the results of Run 4 indicate that both smoothing
sion with carefully designed structural-norm regularization            and merging are useful for the main task. It is not surpris-
on network weights, which can identify feature relationships.          ing that smoothing does not work for the generalization task,
Finally, the fused representation was used to build a classifi-        because, compared with the long movies used in the main
cation model in the last layer. With this special network, we          task, the test clips are short and are relatively temporally
are able to fuse features by considering both feature corre-           more consistent.
lation and feature diversity, as well as perform classification
simultaneously. See [6, 7] for more details.                           Acknowledgements
                                                                       This work was supported in part by a National 863 Program
1.3 Score Smoothing and Clip Merging                                   (#2014AA015101), the National Natural Science Foundation of
   Temporal score smoothing has been proved to be eﬀective             China (#61201387), and the Science and Technology Commis-
as incorrect predictions on a short clip may be eliminated             sion of Shanghai Municipality (#13PJ1400400, #13511504503,
by considering predictions on nearby clips. All the videos             #12511501602).
were first partitioned uniformly into 3-second long clips. A
smoothed prediction score of a clip is simply the average              3.        REFERENCES
value of the scores in a three-clip window.                            [1] M. Sjöberg, B. Ionescu, Y.-G. Jiang, V. L. Quang,
   As we need to output segment level predictions (not on the              M. Schedl, and C.-H. Demarty. The MediaEval 2014 Aﬀect
fixed-length clip-level), we need to merge continuous clips if             Task: Violent Scenes Detection. In MediaEval 2014
they are all determined to contain violence or no violence.                Workshop, Barcelona, Spain, Oct 16-17, 2014.
This was done if their violence scores were all above or below         [2] Y.-G. Jiang, Q. Dai, X. Xue, W. Liu, and C.-W. Ngo.
                                                                           Trajectory-based modeling of human actions with motion
a threshold, and the new score of the merged segment was
                                                                           reference points. In ECCV, 2012.
set to be the average value of clips.                                  [3] S. Roweis. EM Algorithms for PCA and SPCA. NIPS, 1998.
                                                                       [4] H. Wang, C. Schmid. Action Recognition With Improved
2. RESULTS AND DISCUSSIONS                                                 Trajectories. In ICCV, 2013.
  We submitted 5 runs for oﬃcial evaluation. As shown in               [5] I. Laptev. On space-time interest points. IJCV, 64:107–123,
                                                                           2005.
Figure 1, Run 1 and Run 2 used SVM and DNN respectively.
                                                                       [6] Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, X. Xue. Exploring
Run 2 did not use FV encoding of the HOG, HOF and MBH                      Inter-feature and Inter-class Relationships with Deep Neural
features, as the dimensionality of these three features are too            Networks for Video Classification. In ACM MM, 2014.
high, which would jeopardize the performance of DNN when               [7] J. Tu, Z. Wu, Q. Dai, Y.-G. Jiang, X. Xue. Challenge Huawei
there is insuﬃcient training data. Run 3 is the score fusion               Challenge: Fusing Multimodal Features with Deep Neural
of Run 1 and Run 2. Run 4 is the score-smoothed version                    Networks for Mobile Video Annotation. In ICME, 2014.