NII-UIT at MediaEval 2013 Violent Scenes Detection Affect Task Vu Lam Duy-Dinh Le Sang Phan University of Science National Institute of National Institute of 227 Nguyen Van Cu, Dist.5 Informatics Informatics Ho Chi Minh, Vietnam 2-1-2 Hitotsubashi, 2-1-2 Hitotsubashi, lqvu@fit.hcmus.edu.vn Chiyoda-ku Chiyoda-ku Tokyo, Japan 101-8430 Tokyo, Japan 101-8430 ledduy@nii.ac.jp plsang@nii.ac.jp Shin’ichi Satoh Duc Anh Duong National Institute of University of Information Informatics Technology 2-1-2 Hitotsubashi, KM20 Ha Noi highway, Linh Chiyoda-ku Trung Ward,Thu Duc District Tokyo, Japan 101-8430 Ho Chi Minh, Vietnam satoh@nii.ac.jp ducda@uit.edu.vn ABSTRACT feature, at first we build attribute classifiers for 7 visual We present a comprehensive evaluation of shot-based visual attributes: fights, blood, gore, fire, car chase, cold arms, and audio features for MediaEval 2013 - Violent Scenes De- firearms. After that, we concatenate output scores of each tection Affect Task. To obtain visual features, we use global attribute classifier to form the mid-level feature representa- features, local SIFT features and motion features. For audio tion. For all features, we use the popular SVM algorithm features, the popular MFCC is employed. Besides that, we for learning. Finally, the probability output scores of the also evaluate the performance of mid-level features which is learned classifier are used for ranking retrieved shots. constructed using visual concepts. We combined these fea- We use the same framework for evaluating both objective tures using late fusion. The results obtained by our runs are and subjective tasks (just different annotations). Our results presented. show that the combined runs using all visual, audio and mid- level features achieved the best performance. Keywords 2. LOW LEVEL FEATURE semantic concept detection, global feature, local feature, mo- We use feature from different modalities to test if they are tion feature, audio feature, mid-level feature, late fusion complementary for violent scenes detection. Currently, we have developed our VSD system to incorporate still image 1. INTRODUCTION feature, motion feature, and audio feature. We have developed NII-KAORI-SECODE, a general frame- work for semantic concept detection, and used it to partic- 2.1 Still Image Features ipate in several benchmarks such as IMAGECLEF, MEDI- We use both global and local features for VSD because AEVAL, PASCAL-VOC, IMAGE-NET and TRECVID. In they capture different characteristics of images. For global this year, we evaluate performance for concept detection- feature, we use Color Histogram (CH), Color Moment (CM), like task using shot-based feature representations only. Our Edge Oriented Histogram (EOH), and Local Binary Pattern previous works show that using the shot-based features not (LBP). For local feature, we use popular SIFT with both only reduce the computational cost but also improve the Hessian Laplace interest points and dense sampling at mul- performance. tiple scales. For dense sampling, besides the standard SIFT We consider the Violent Scenes Detection (VSD) Task [1] descriptor, we also use Opponent-SIFT and C-SIFT. For in- as a concept detection task and use the NII-KAORI-SECODE terest point detector, we only use normal SIFT descriptor. framework for evaluation. Firstly, keyframes are extracted We also employed the bag-of-words model with a codebook by sampling 5 keyframes/second. Raw features are extracted size of 1000 and the soft-assignment technique to generate for all keyframes in each shot and then shot-based features a fixed-dimension feature representation for each keyframe. are formed from its keyfame-based feature by applying av- Beside encoding the whole image, we also divided it into erage or max pooling. Motion feature and audio feature grids of 3x1 and 2x2 to encode spatial information. Finally, are extracted directly from the whole shot. For mid-level in order to generate a single representation for each shot, we employed two pooling strategies: average pooling and max pooling. Copyright is held by the author/owner(s). MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain 2.2 Motion Feature Trajectories are obtained by tracking the densely sampled points in the optical flow fields. We use Motion Boundary Histogram (MBH) to describe each trajectory. This feature descriptor is know to perform well for handling camera mo- tion. For motion feature we use Fisher vector encoding after reducing feature dimension using PCA. The codebook size is 256, trained using a Gaussian Mixture Model (GMM). The final feature dimension is 65,536. 2.3 Audio Feature We use the popular MFCC for extracting audio feature. We choose a length of 25ms for audio segments and a step size of 10ms. The 13d MFCCs along with each first and sec- ond derivatives are used for representing each audio segment. Figure 1: Mid-level feature construction Raw MFCC features are also encoded using Fisher vector. We use GMM to build the codebook with 256 clusters. We also apply PCA to reduce feature dimension, resulting fea- ture descriptors of 12,288 dimensions. 3. MID-LEVEL FEATURE Beside low-level features, we also investigate how to use related violent information as mid-level feature to detect vi- olent scenes. We use only seven violent concepts to create attributes: fire, firearms, cold arms, car chase, gore, blood, and fight. We use low-level image feature to train the at- tribute classifiers on the VSD development set of 2011. For each image, we apply these attribute classifiers to get score values corresponding to each attribute. After that, we con- catenate all these values to form the mid-level representation of each image. We then train our mid-level classifier on the VSD development and test set of 2012. Finally, this classi- Figure 2: Results of our submitted runs fier is used for testing on this year’s set. The detailed work flow is shown in Figure 1. and subjective tasks. For each task, we report two evalua- 4. CLASSIFICATION tion metrics: overall MAP and MAP100, which is the MAP at top 100 return shots. Our best run is the fusion run of LibSVM is used for training and testing at shot level all global, local, motion and audio feature (R1). This ob- (based on shot boundaries provided by the organizers). To servation confirm the benefit of combining multiple features generate training data, shots which fall into positive seg- for violent scenes detection. Among all submitted runs, the ments more than 80% will be considered as positive shots. run using mid-level (R3) performs the worst. However, it The remaining shots are considered as negative. Extracted can be complementary for combining with other low-level features are scaled to [0, 1] using the svm-scale tool of Lib- features (R1). The combined run using motion feature and SVM. For still image features, we use a chi-square kernel to audio (R4) feature did not achieve good results as expected. calculate the distance matrix. For audio and motion feature, In fact, its performance is lower than the combined run of which are encoded using fisher vector, a linear kernel is used. still image features (R5). This can be due to minor motion The optimal gamma and cost parameters for learning SVM in each shots and/or noise in audio signals. classifiers are found by conducting a grid search with 5-fold Our future study includes investigating the contribution of cross validation on the training dataset. motion features and audio features. The result of mid-level features is also promising. Currently, we only use 7 visual 5. SUBMITTED RUNS concepts for constructing mid-level features. In the future, We employ a simple late fusion strategy on the aforemen- we will incorporate audio concepts using audio feature. tioned low-level and mid-level features, giving equal weights to the different factors. We submitted five runs in total: 7. ACKNOWLEDGEMENTS (R5) Fusion of all 4 global features and 5 local features; This research is partially funded by Vietnam National (R4) Fusion of motion feature (dense trajectories + MBH) University Ho Chi Minh City (VNU-HCM) under grant num- and audio feature (MFCC); (R3) The run using mid-level ber B2013-26-01. feature; (R2) Fusion of R4 and R5; and (R1) Fusion of R3, R4 and R5. 8. REFERENCES [1] Demarty C.H, Penet C., Schedl M., Ionescu B., Lam Q. 6. RESULTS AND DISCUSSIONS V. and Jiang Y. G. The MediaEval 2013 Affect Task: The detailed performance for each submitted run is shown Violent Scenes Detection, MediaEval 2013 Workshop, in Figure 2. We report the performance of both objective October 18-19, 2013, Barcelona, Spain.