MIC-TJU at MediaEval Violent Scenes Detection (VSD) 2014

                                   Bowen Zhang, Yun Yi, Hanli Wang, Jian Yu
                                    Department of Computer Science and Technology
                                     Tongji University, Shanghai 201804, P. R. China
                              {102310,13yiyun,hanliwang,yujian}@tongji.edu.cn


ABSTRACT
The task of Violent Scenes Detection requires creating a sys-
tem to detect segments which contain physical violence in
both movies and videos found on the web, which is a very
challenging task due to camera jitters in hand-shot videos
and free shot boundary in movies and web videos. In this pa-
per, we present a novel system by combining shot boundary
detection, feature extraction in both audio and video do-
mains, Bag-of-Words model and Support Vector Machine.
The key part of system lies in trajectory based features that
are calculated around robust optical flows. These flows are
extracted by a novel salient keypoint trajectory algorithm.
According to our results, a good detection performance can
be achieved by using trajectory based features combined
with dense SIFT and MFCC.                                        Figure 1: Overview of MIC-TJU system for VSD 2014.


1.   INTRODUCTION                                                2.1     Shot Boundary Detection
   Violent Scenes Detection (VSD) is a challenging task which       In VSD 2014, there are no video shot boundaries provided,
requires teams to build a high performance system to auto-       neither for movies nor web videos. This causes difficulties
matically detect video segments containing violence. VSD         for feature extraction and encoding. In order to address this
2014 contains two different sub-tasks: main task and gener-      issue, we employ the shot boundary detection method pre-
alization task. A brief introduction to the dataset for train-   sented in [6], which adopts difference of histograms using an
ing and testing as well as evaluation metrics of these two       adaptive threshold. Specifically, the difference of histograms
sub-tasks is given in [4]. In this paper, we discuss the tech-   between two adjacent frames is firstly computed. We set a
niques and algorithms employed by our system, as well as         range of 15 frames ahead of the current frame to compute
the system architecture and evaluation results.                  standard variance (STD) and mean. If the STD is lower
                                                                 than a specific value namely Tvb , it means that there are few
                                                                 fluctuations in these 15 frames. These frames can be used
2.   SYSTEM DESCRIPTION                                          to adapt video shot boundary thresholds. In this work, Tvb
   The architecture of the proposed system is shown in Fig. 1.   is set to 500,000, which empirically shows good results. In
We adopt the Bag-of-Words (BoW) framework with Gaus-             order to enhance the robustness of shot boundary detection,
sian Mixture Model (GMM), Fisher Vector (FV) and Sup-            we use a method based on two thresholds to detect both
port Vector Machine (SVM). A threshold based video shot          hard cuts and gradual changes. The lower threshold is used
boundary detector is firstly used to detect video shot bound-    to detect gradual changes and the higher one is for hard cuts.
aries [6]. After that, we extract features from audio and        These two adaptive thresholds are computed based on the
video. FV are then used to encode video and audio fea-           aforementioned mean of previous differences of histograms.
tures into a single high dimensional vector using a codebook     A hard cut will be detected if the difference of histograms
generated by a GMM. Since it is observed that fusion has         between the current frame and the previous frame exceeds
a great influence on the final results, different fusion meth-   the corresponding threshold for hard cut detection.
ods are used to fuse vectors from different features. Because
SVM with linear kernel shows good performances with FV,          2.2     Feature Extraction
it is employed as the classifier of our system [1][5].             For feature extraction, two different kinds of video fea-
                                                                 tures are used including trajectory based features and one
                                                                 appearance feature.

                                                                 2.2.1    Video Features
Copyright is held by the author/owner(s).
MediaEval 2014 Workshop, October 16-17,2014,Barcelona,Spain        Firstly, salient keypoint trajectories are generated to track
                                      Table 1: Configuration of runs of MIC-TJU.
                Run    Trajectory based Features    Appearance Feature   Audio Feature      Fusion        Weights
                 1         HOG,HOF,MBH                      -               MFCC          Late Fusion       4:1
                 2         HOG,HOF,MBH                 Dense SIFT           MFCC         Double Fusion      4:1
                 3         HOG,HOF,MBH                 Dense SIFT           MFCC         Double Fusion      1:1
                 4         HOG,HOF,MBH                 Dense SIFT           MFCC          Late Fusion      4:1:1
                 5         HOG,HOF,MBH                 Dense SIFT           MFCC          Late Fusion      1:1:1


human actions at multiple spatial scales [5]. Then, cam-           during late fusion.
era motion elimination [5] is utilized to further improve the
robustness of the trajectories. To encode human motions            3.    RESULTS AND DISCUSSIONS
accurately and efficiently, the Histogram of Oriented Gra-
                                                                     We submit five runs with the results given in Table 2 us-
dient (HOG), Histogram of Optical Flow (HOF) and Mo-
                                                                   ing the MAP2014 measure. The comparison of run1 and
tion Boundary Histogram (MBH) are employed with the FV
                                                                   run4 show that the dense SIFT feature can help improve the
model being utilized to aggregate these three features [5].
                                                                   recognition performance in the generalization task. How-
The dimensions of these three descriptors are 96 for HOG,
                                                                   ever, there is a performance drop in the main task. The
108 for HOF and 192 for MBH, respectively. On the other
                                                                   reason for this is that the late fusion strategy and weights
hand, regarding the appearance feature, we use densely ex-
                                                                   assignment are sub-optimized for dense SIFT in the main
tracted SIFT features. We compute SIFT descriptors ev-
                                                                   task. By comparing run2 vs. run3 as well as run4 vs. run5,
ery 60 video frames at multiple scales on a dense grid (i.e.,
                                                                   we conclude that different weights assignment will affect the
21×21 patches with 4 pixel steps and 5 scales) [3].
                                                                   recognition performances, and the optimum weight setting
   After the extraction of descriptors, these feature vectors
                                                                   differs for different datasets. In general, we obtain better re-
are normalized with the signed square root, and then, PCA
                                                                   sults in the generalization task than the main task. One rea-
is individually applied to each of these three feature vectors
                                                                   son for this is that the video shots in the generalization task
(HOG, HOF and MBH) to reduce to half of the original di-
                                                                   do not change as frequent as that in the main task, which
mension. Then, FVs are computed to construct a codebook
                                                                   improves the performance of trajectory based features. It
for each descriptor. We compute one FV over the complete
                                                                   also indicates that the main task is more challenging than
video, and apply signed square root normalization which is
                                                                   the generalization task.
able to significantly improve the recognition performance in
combination with linear SVM.
   As far as classification is concerned, linear SVM is em-              Table 2: Results of MIC-TJU on MAP2014.
ployed in this work and early fusion is performed to generate                 Run     Main Task    Generalization Task
the final feature vector by concatenating the aforementioned                   1       44.17%           56.01%
four feature vectors (HOG, HOF, MBH and dense SIFT)                            2       43.07%           56.52%
into a single one. In our implementation, the standard lin-                    3       44.60%           55.56%
ear LIBSVM is used with the penalty parameter C equal to                       4       39.23%           56.62%
100, which has shown to exhibit good performances.                             5       38.50%           56.00%

2.2.2    Audio Features
   Due to auditory clues in segments which contain violent
scenes, features of audio segments should be considered.
                                                                   4.    REFERENCES
Therefore, we adopt the popular Mel-Frequency Cepstral             [1] C.-C. Chang and C.-J. Lin. LIBSVM: A library for
Coeffcients (MFCC) algorithm [2]. The time window for                  support vector machines. ACM Trans. Intell. Syst.
each MFCC is 32 ms and there is 50% overlap between                    Technol., 2(3):27:1–27:27, Apr. 2011.
two adjacent windows. To fully utilize the discrimination          [2] S. Davis and P. Mermelstein. Comparison of parametric
ability of MFCC, we integrate delta and double-delta of                representations for monosyllabic word recognition in
MFCC vector into the original MFCC vector to generate                  continuously spoken sentences. IEEE Trans. Acoust.,
a 60-dimensional MFCC vector. In order to represent a                  Speech, Signal Processing., 28(4):357–366, Aug. 1980.
whole audio file as a single vector, we adopt the classic BoW      [3] D. Oneata, J. Verbeek, and C. Schmid. Action and
framework, where FV and GMM are used. Linear LIBSVM                    event recognition with fisher vectors on a compact
is used as the classifier for audio features with the penalty          feature set. In ICCV’13, pages 1817–1824, 2013.
parameter C equal to 100.                                          [4] M. Sjöberg, B. Ionescu, Y.-G. Jiang, V. L. Quang,
                                                                       M. Schedl, and C.-H. Demarty. The mediaeval 2014
2.3     Experimental Setup                                             affect task: Violent scenes detection. In MediaEval 2014
   The configuration of our submitted five runs are summa-             Workshop, 2014.
rized in Table 1. Regarding the late fusion, an arithmetic         [5] H. Wang and C. Schmid. Action recognition with
sum of scores outputted from SVM for video features (tra-              improved trajectories. In ICCV’13, pages 3551–3558,
jectory based features and appearance feature) and audio               2013.
feature is calculated; for double fusion, first we do early fu-    [6] D. Zhang, W. Qi, and H. J. Zhang. A new shot
sion of video features, and then late fusion of video and audio        boundary detection algorithm. In PCM’01, pages
features. The weight setting segmented by colon in Table 1             63–70. 2001.
stands for the weights applied to different kinds of features