=Paper=
{{Paper
|id=Vol-1263/paper57
|storemode=property
|title=NII-UIT at MediaEval 2014 Violent Scenes Detection Affect Task
|pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_57.pdf
|volume=Vol-1263
|dblpUrl=https://dblp.org/rec/conf/mediaeval/LamLPSD14
}}
==NII-UIT at MediaEval 2014 Violent Scenes Detection Affect Task==
NII-UIT at MediaEval 2014 Violent Scenes Detection Affect Task Vu Lam Duy-Dinh Le Sang Phan University of Science National Institute of National Institute of 227 Nguyen Van Cu, Dist.5 Informatics Informatics Ho Chi Minh, Vietnam 2-1-2 Hitotsubashi, 2-1-2 Hitotsubashi, lqvu@fit.hcmus.edu.vn Chiyoda-ku Chiyoda-ku Tokyo, Japan 101-8430 Tokyo, Japan 101-8430 ledduy@nii.ac.jp plsang@nii.ac.jp Shin’ichi Satoh Duc Anh Duong National Institute of University of Information Informatics Technology 2-1-2 Hitotsubashi, KM20 Ha Noi highway, Linh Chiyoda-ku Trung Ward,Thu Duc District Tokyo, Japan 101-8430 Ho Chi Minh, Vietnam satoh@nii.ac.jp ducda@uit.edu.vn ABSTRACT 2. FEATURE EXTRACTION Violent scene detection (VSD) is a challenging problem be- We use features from different modalities to test if they are cause of the heterogeneous content, large variations in video complementary for violent scenes detection. Currently, we quality, and semantic meaning of the concepts. The Violent have developed our VSD system to incorporate still image Scenes Detection Task of MediaEval [1] provides a common feature, motion feature, and audio feature. dataset and evaluation protocol thus enables a fair compari- son of methods. In this paper, we describe our VSD system 2.1 Still Image Features used in MediaEval 2014 and briefly discuss the performance In this year, we use only SIFT-based features for VSD results obtained in main subjective tasks. In this year, we fo- because they could capture different characteristics of im- cus on improving the trajectory-based motion features that ages. We use popular SIFT-based features with both Hes- have been proven effective in previous year’s evaluation. Be- sian Laplace interest points and dense sampling at multiple sides that, we also adopt SIFT-based and audio features as scales. Besides the standard SIFT descriptor, we also use in last year’s system. We combined these features using Opponent-SIFT and Color-SIFT [2]. We employ the bag- late fusion. Our results show that the trajectory-based mo- of-words model with a codebook size of 1000 and the soft- tion features still have very competitive performance and assignment technique to generate a fixed-dimension feature the combination with still image features and audio features representation for each keyframe. Beside encoding the whole can improve overall performance. image, we also divide it into grids of 3x1 and 2x2 to encode spatial information. Finally, in order to generate a single representation for each segment, we use two pooling strate- 1. INTRODUCTION gies: average pooling and max pooling. We consider the Violent Scenes Detection (VSD) task [1] as a concept detection task. For evaluation, we use our NII- 2.2 Motion Feature KAORI-SECODE framework, which has been achieved good We use the Improved Trajectories [3] to extract dense tra- performances on other benchmarks such as ImageCLEF and jectories. A combination of Histogram of Oriented Gradi- PASCALVOC. Firstly, videos are divided into equal seg- ents (HOG), Histogram of Optical Flow (HOF) and Mo- ments with 5-second length. In each segment, keyframes tion Boundary Histogram (MBH) is used to describe each are extracted by sampling 5 keyframes per second. For still trajectory. We encode HOGHOF and MBH features sepa- image features, local descriptors are extracted and encoded rately using the Fisher Vector encoding. The codebook size for all keyframes in each segment and then segment-based is 256, trained using a Gaussian Mixture Model (GMM). features are formed from their keyframe-based features by The feature representation of each descriptor after applying applying average or max pooling. Motion feature and audio PCA has 65,536 dimensions. Finally, these two features are feature are extracted directly from the whole segment. For concatenated to form the final feature vector with 131,072 all features, we use the popular SVM algorithm for learning. dimensions. Finally, the probability output scores of the learned classifier are used for ranking retrieved segments. 2.3 Audio Feature We use the popular Mel-frequency Cepstral Coefficients (MFCC) for extracting audio features. We choose a length Copyright is held by the author/owner(s). of 25ms for audio segments and a step size of 10ms. The 13- MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain dimensional MFCC vectors along with each first and second Figure 1: Overview of our system and the 5 submit- ted runs. Figure 3: Results for the main task with MAP2014 and MAP@100(2013) metrics equal weight; (R3) same as R1 but using training set B; (R4) using training set B, we fuse all still image features with motion and audio using learnt fusion weights from validation set; (R5) using training set B, we fuse motion and audio features with equal weights. 5. RESULTS AND DISCUSSIONS The detailed performance for each submitted run is shown in Figure 3. Our best run is the fusion run of best single still image features (RGBSIFT), motion and audio features (R1). There is not a big gap among submitted runs. We see that, the performance of motion features with Fisher vector Figure 2: Our framework for extracting and encod- encoding is alway good and significantly better than others. ing motion and audio feature. In all submitted runs, we used motion features as a base to fuse with others. Audio and still image features did not achieve good performance, but they can be complementary derivatives are used for representing each audio segment. to motion features. Another interesting observation is that Raw MFCC features are also encoded using Fisher vector runs trained on fewer videos (training set A - 14 videos) have encoding. We use a GMM to train the codebook with 256 better performance than the runs in which set (24 videos) clusters. For audio features, we do not use PCA. The final was used. This indicates that the second training set might feature descriptor has 19,968 dimensions. Our motion and contain ambiguous violent scene’s annotations, which harms audio framework are shown in Fig 2. the detection performance. 3. CLASSIFICATION 6. ACKNOWLEDGEMENTS LibSVM [4] is used for training and testing at segment This research is partially funded by Vietnam National level. To generate training data, segments of which at least University Ho Chi Minh City (VNU-HCM) under grant num- 80% are marked as violent according to the ground truth. ber B2013-26-01. Extracted features are scaled to [0, 1] using the SVM-scale tool of LibSVM. The remaining segments are considered as 7. REFERENCES negative. For still image features, we use a chi-square kernel [1] M. Sjöberg, B. Ionescu, Y. Jiang, V. Quang, M. Schedl, to calculate the distance matrix. For audio and motion fea- and C. Demarty. The MediaEval 2014 Affect Task: tures, which are encoded using Fisher vector, a linear kernel Violent Scenes Detection. In MediaEval 2014 is used. The optimal gamma and cost parameters for learn- Workshop, Barcelona, Spain, October 16-17 2014. ing SVM classifiers are found by conducting a grid search [2] K. Van de Sande, T. Gevers, C. Snoek, ”Evaluating with 5-fold cross validation on the training dataset. Color Descriptors for Object and Scene Recognition,” Pattern Analysis and Machine Intelligence, IEEE 4. SUBMITTED RUNS Transactions on , vol.32, no.9, pp.1582,1596, Sept. 2010 We select two training sets: (A) uses 14 videos, (B) uses [3] H. Wang and C. Schmid. Action Recognition with 24 videos. We use the VSD 2013 test dataset (7 videos) as Improved Trajectories. In Proceedings of the 2013 validation set. We employ a simple late fusion strategy on IEEE International Conference on Computer Vision the above features, using equal weights and learnt weights. (ICCV ’13). IEEE Computer Society, Washington, DC, We submitted five runs in total (Fig 1): (R1) using training USA, 3551-3558. set A, we first select the best still image feature and fuse it [4] C.-C. Chang and C.-J. Lin. LIBSVM : a library for with motion and audio features; (R2) using training set B, support vector machines. ACM Transactions on we fuse all still image features with motion and audio using Intelligent Systems and Technology, 2:27:1–27:27, 2011.