MIC-TJU at MediaEval Violent Scenes Detection (VSD) 2014 Bowen Zhang, Yun Yi, Hanli Wang, Jian Yu Department of Computer Science and Technology Tongji University, Shanghai 201804, P. R. China {102310,13yiyun,hanliwang,yujian}@tongji.edu.cn ABSTRACT The task of Violent Scenes Detection requires creating a sys- tem to detect segments which contain physical violence in both movies and videos found on the web, which is a very challenging task due to camera jitters in hand-shot videos and free shot boundary in movies and web videos. In this pa- per, we present a novel system by combining shot boundary detection, feature extraction in both audio and video do- mains, Bag-of-Words model and Support Vector Machine. The key part of system lies in trajectory based features that are calculated around robust optical flows. These flows are extracted by a novel salient keypoint trajectory algorithm. According to our results, a good detection performance can be achieved by using trajectory based features combined with dense SIFT and MFCC. Figure 1: Overview of MIC-TJU system for VSD 2014. 1. INTRODUCTION 2.1 Shot Boundary Detection Violent Scenes Detection (VSD) is a challenging task which In VSD 2014, there are no video shot boundaries provided, requires teams to build a high performance system to auto- neither for movies nor web videos. This causes difficulties matically detect video segments containing violence. VSD for feature extraction and encoding. In order to address this 2014 contains two different sub-tasks: main task and gener- issue, we employ the shot boundary detection method pre- alization task. A brief introduction to the dataset for train- sented in [6], which adopts difference of histograms using an ing and testing as well as evaluation metrics of these two adaptive threshold. Specifically, the difference of histograms sub-tasks is given in [4]. In this paper, we discuss the tech- between two adjacent frames is firstly computed. We set a niques and algorithms employed by our system, as well as range of 15 frames ahead of the current frame to compute the system architecture and evaluation results. standard variance (STD) and mean. If the STD is lower than a specific value namely Tvb , it means that there are few fluctuations in these 15 frames. These frames can be used 2. SYSTEM DESCRIPTION to adapt video shot boundary thresholds. In this work, Tvb The architecture of the proposed system is shown in Fig. 1. is set to 500,000, which empirically shows good results. In We adopt the Bag-of-Words (BoW) framework with Gaus- order to enhance the robustness of shot boundary detection, sian Mixture Model (GMM), Fisher Vector (FV) and Sup- we use a method based on two thresholds to detect both port Vector Machine (SVM). A threshold based video shot hard cuts and gradual changes. The lower threshold is used boundary detector is firstly used to detect video shot bound- to detect gradual changes and the higher one is for hard cuts. aries [6]. After that, we extract features from audio and These two adaptive thresholds are computed based on the video. FV are then used to encode video and audio fea- aforementioned mean of previous differences of histograms. tures into a single high dimensional vector using a codebook A hard cut will be detected if the difference of histograms generated by a GMM. Since it is observed that fusion has between the current frame and the previous frame exceeds a great influence on the final results, different fusion meth- the corresponding threshold for hard cut detection. ods are used to fuse vectors from different features. Because SVM with linear kernel shows good performances with FV, 2.2 Feature Extraction it is employed as the classifier of our system [1][5]. For feature extraction, two different kinds of video fea- tures are used including trajectory based features and one appearance feature. 2.2.1 Video Features Copyright is held by the author/owner(s). MediaEval 2014 Workshop, October 16-17,2014,Barcelona,Spain Firstly, salient keypoint trajectories are generated to track Table 1: Configuration of runs of MIC-TJU. Run Trajectory based Features Appearance Feature Audio Feature Fusion Weights 1 HOG,HOF,MBH - MFCC Late Fusion 4:1 2 HOG,HOF,MBH Dense SIFT MFCC Double Fusion 4:1 3 HOG,HOF,MBH Dense SIFT MFCC Double Fusion 1:1 4 HOG,HOF,MBH Dense SIFT MFCC Late Fusion 4:1:1 5 HOG,HOF,MBH Dense SIFT MFCC Late Fusion 1:1:1 human actions at multiple spatial scales [5]. Then, cam- during late fusion. era motion elimination [5] is utilized to further improve the robustness of the trajectories. To encode human motions 3. RESULTS AND DISCUSSIONS accurately and efficiently, the Histogram of Oriented Gra- We submit five runs with the results given in Table 2 us- dient (HOG), Histogram of Optical Flow (HOF) and Mo- ing the MAP2014 measure. The comparison of run1 and tion Boundary Histogram (MBH) are employed with the FV run4 show that the dense SIFT feature can help improve the model being utilized to aggregate these three features [5]. recognition performance in the generalization task. How- The dimensions of these three descriptors are 96 for HOG, ever, there is a performance drop in the main task. The 108 for HOF and 192 for MBH, respectively. On the other reason for this is that the late fusion strategy and weights hand, regarding the appearance feature, we use densely ex- assignment are sub-optimized for dense SIFT in the main tracted SIFT features. We compute SIFT descriptors ev- task. By comparing run2 vs. run3 as well as run4 vs. run5, ery 60 video frames at multiple scales on a dense grid (i.e., we conclude that different weights assignment will affect the 21×21 patches with 4 pixel steps and 5 scales) [3]. recognition performances, and the optimum weight setting After the extraction of descriptors, these feature vectors differs for different datasets. In general, we obtain better re- are normalized with the signed square root, and then, PCA sults in the generalization task than the main task. One rea- is individually applied to each of these three feature vectors son for this is that the video shots in the generalization task (HOG, HOF and MBH) to reduce to half of the original di- do not change as frequent as that in the main task, which mension. Then, FVs are computed to construct a codebook improves the performance of trajectory based features. It for each descriptor. We compute one FV over the complete also indicates that the main task is more challenging than video, and apply signed square root normalization which is the generalization task. able to significantly improve the recognition performance in combination with linear SVM. As far as classification is concerned, linear SVM is em- Table 2: Results of MIC-TJU on MAP2014. ployed in this work and early fusion is performed to generate Run Main Task Generalization Task the final feature vector by concatenating the aforementioned 1 44.17% 56.01% four feature vectors (HOG, HOF, MBH and dense SIFT) 2 43.07% 56.52% into a single one. In our implementation, the standard lin- 3 44.60% 55.56% ear LIBSVM is used with the penalty parameter C equal to 4 39.23% 56.62% 100, which has shown to exhibit good performances. 5 38.50% 56.00% 2.2.2 Audio Features Due to auditory clues in segments which contain violent scenes, features of audio segments should be considered. 4. REFERENCES Therefore, we adopt the popular Mel-Frequency Cepstral [1] C.-C. Chang and C.-J. Lin. LIBSVM: A library for Coeffcients (MFCC) algorithm [2]. The time window for support vector machines. ACM Trans. Intell. Syst. each MFCC is 32 ms and there is 50% overlap between Technol., 2(3):27:1–27:27, Apr. 2011. two adjacent windows. To fully utilize the discrimination [2] S. Davis and P. Mermelstein. Comparison of parametric ability of MFCC, we integrate delta and double-delta of representations for monosyllabic word recognition in MFCC vector into the original MFCC vector to generate continuously spoken sentences. IEEE Trans. Acoust., a 60-dimensional MFCC vector. In order to represent a Speech, Signal Processing., 28(4):357–366, Aug. 1980. whole audio file as a single vector, we adopt the classic BoW [3] D. Oneata, J. Verbeek, and C. Schmid. Action and framework, where FV and GMM are used. Linear LIBSVM event recognition with fisher vectors on a compact is used as the classifier for audio features with the penalty feature set. In ICCV’13, pages 1817–1824, 2013. parameter C equal to 100. [4] M. Sjöberg, B. Ionescu, Y.-G. Jiang, V. L. Quang, M. Schedl, and C.-H. Demarty. The mediaeval 2014 2.3 Experimental Setup affect task: Violent scenes detection. In MediaEval 2014 The configuration of our submitted five runs are summa- Workshop, 2014. rized in Table 1. Regarding the late fusion, an arithmetic [5] H. Wang and C. Schmid. Action recognition with sum of scores outputted from SVM for video features (tra- improved trajectories. In ICCV’13, pages 3551–3558, jectory based features and appearance feature) and audio 2013. feature is calculated; for double fusion, first we do early fu- [6] D. Zhang, W. Qi, and H. J. Zhang. A new shot sion of video features, and then late fusion of video and audio boundary detection algorithm. In PCM’01, pages features. The weight setting segmented by colon in Table 1 63–70. 2001. stands for the weights applied to different kinds of features