=Paper= {{Paper |id=Vol-1263/paper57 |storemode=property |title=NII-UIT at MediaEval 2014 Violent Scenes Detection Affect Task |pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_57.pdf |volume=Vol-1263 |dblpUrl=https://dblp.org/rec/conf/mediaeval/LamLPSD14 }} ==NII-UIT at MediaEval 2014 Violent Scenes Detection Affect Task== https://ceur-ws.org/Vol-1263/mediaeval2014_submission_57.pdf
                               NII-UIT at MediaEval 2014
                         Violent Scenes Detection Affect Task

                    Vu Lam                              Duy-Dinh Le                             Sang Phan
           University of Science                     National Institute of                  National Institute of
         227 Nguyen Van Cu, Dist.5                       Informatics                            Informatics
           Ho Chi Minh, Vietnam                      2-1-2 Hitotsubashi,                    2-1-2 Hitotsubashi,
         lqvu@fit.hcmus.edu.vn                           Chiyoda-ku                             Chiyoda-ku
                                                   Tokyo, Japan 101-8430                  Tokyo, Japan 101-8430
                                                 ledduy@nii.ac.jp           plsang@nii.ac.jp
                                     Shin’ichi Satoh            Duc Anh Duong
                                   National Institute of             University of Information
                                       Informatics                         Technology
                                   2-1-2 Hitotsubashi,             KM20 Ha Noi highway, Linh
                                       Chiyoda-ku                  Trung Ward,Thu Duc District
                                 Tokyo, Japan 101-8430                Ho Chi Minh, Vietnam
                                    satoh@nii.ac.jp                     ducda@uit.edu.vn

ABSTRACT                                                           2.    FEATURE EXTRACTION
Violent scene detection (VSD) is a challenging problem be-           We use features from different modalities to test if they are
cause of the heterogeneous content, large variations in video      complementary for violent scenes detection. Currently, we
quality, and semantic meaning of the concepts. The Violent         have developed our VSD system to incorporate still image
Scenes Detection Task of MediaEval [1] provides a common           feature, motion feature, and audio feature.
dataset and evaluation protocol thus enables a fair compari-
son of methods. In this paper, we describe our VSD system
                                                                   2.1    Still Image Features
used in MediaEval 2014 and briefly discuss the performance            In this year, we use only SIFT-based features for VSD
results obtained in main subjective tasks. In this year, we fo-    because they could capture different characteristics of im-
cus on improving the trajectory-based motion features that         ages. We use popular SIFT-based features with both Hes-
have been proven effective in previous year’s evaluation. Be-      sian Laplace interest points and dense sampling at multiple
sides that, we also adopt SIFT-based and audio features as         scales. Besides the standard SIFT descriptor, we also use
in last year’s system. We combined these features using            Opponent-SIFT and Color-SIFT [2]. We employ the bag-
late fusion. Our results show that the trajectory-based mo-        of-words model with a codebook size of 1000 and the soft-
tion features still have very competitive performance and          assignment technique to generate a fixed-dimension feature
the combination with still image features and audio features       representation for each keyframe. Beside encoding the whole
can improve overall performance.                                   image, we also divide it into grids of 3x1 and 2x2 to encode
                                                                   spatial information. Finally, in order to generate a single
                                                                   representation for each segment, we use two pooling strate-
1.   INTRODUCTION                                                  gies: average pooling and max pooling.
   We consider the Violent Scenes Detection (VSD) task [1]
as a concept detection task. For evaluation, we use our NII-       2.2    Motion Feature
KAORI-SECODE framework, which has been achieved good                  We use the Improved Trajectories [3] to extract dense tra-
performances on other benchmarks such as ImageCLEF and             jectories. A combination of Histogram of Oriented Gradi-
PASCALVOC. Firstly, videos are divided into equal seg-             ents (HOG), Histogram of Optical Flow (HOF) and Mo-
ments with 5-second length. In each segment, keyframes             tion Boundary Histogram (MBH) is used to describe each
are extracted by sampling 5 keyframes per second. For still        trajectory. We encode HOGHOF and MBH features sepa-
image features, local descriptors are extracted and encoded        rately using the Fisher Vector encoding. The codebook size
for all keyframes in each segment and then segment-based           is 256, trained using a Gaussian Mixture Model (GMM).
features are formed from their keyframe-based features by          The feature representation of each descriptor after applying
applying average or max pooling. Motion feature and audio          PCA has 65,536 dimensions. Finally, these two features are
feature are extracted directly from the whole segment. For         concatenated to form the final feature vector with 131,072
all features, we use the popular SVM algorithm for learning.       dimensions.
Finally, the probability output scores of the learned classifier
are used for ranking retrieved segments.                           2.3    Audio Feature
                                                                      We use the popular Mel-frequency Cepstral Coefficients
                                                                   (MFCC) for extracting audio features. We choose a length
Copyright is held by the author/owner(s).                          of 25ms for audio segments and a step size of 10ms. The 13-
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain     dimensional MFCC vectors along with each first and second
Figure 1: Overview of our system and the 5 submit-
ted runs.


                                                                  Figure 3: Results for the main task with MAP2014
                                                                  and MAP@100(2013) metrics


                                                                  equal weight; (R3) same as R1 but using training set B;
                                                                  (R4) using training set B, we fuse all still image features with
                                                                  motion and audio using learnt fusion weights from validation
                                                                  set; (R5) using training set B, we fuse motion and audio
                                                                  features with equal weights.

                                                                  5.   RESULTS AND DISCUSSIONS
                                                                     The detailed performance for each submitted run is shown
                                                                  in Figure 3. Our best run is the fusion run of best single
                                                                  still image features (RGBSIFT), motion and audio features
                                                                  (R1). There is not a big gap among submitted runs. We see
                                                                  that, the performance of motion features with Fisher vector
Figure 2: Our framework for extracting and encod-                 encoding is alway good and significantly better than others.
ing motion and audio feature.                                     In all submitted runs, we used motion features as a base
                                                                  to fuse with others. Audio and still image features did not
                                                                  achieve good performance, but they can be complementary
derivatives are used for representing each audio segment.         to motion features. Another interesting observation is that
Raw MFCC features are also encoded using Fisher vector            runs trained on fewer videos (training set A - 14 videos) have
encoding. We use a GMM to train the codebook with 256             better performance than the runs in which set (24 videos)
clusters. For audio features, we do not use PCA. The final        was used. This indicates that the second training set might
feature descriptor has 19,968 dimensions. Our motion and          contain ambiguous violent scene’s annotations, which harms
audio framework are shown in Fig 2.                               the detection performance.


3.   CLASSIFICATION                                               6.   ACKNOWLEDGEMENTS
   LibSVM [4] is used for training and testing at segment           This research is partially funded by Vietnam National
level. To generate training data, segments of which at least      University Ho Chi Minh City (VNU-HCM) under grant num-
80% are marked as violent according to the ground truth.          ber B2013-26-01.
Extracted features are scaled to [0, 1] using the SVM-scale
tool of LibSVM. The remaining segments are considered as          7.   REFERENCES
negative. For still image features, we use a chi-square kernel    [1] M. Sjöberg, B. Ionescu, Y. Jiang, V. Quang, M. Schedl,
to calculate the distance matrix. For audio and motion fea-           and C. Demarty. The MediaEval 2014 Affect Task:
tures, which are encoded using Fisher vector, a linear kernel         Violent Scenes Detection. In MediaEval 2014
is used. The optimal gamma and cost parameters for learn-             Workshop, Barcelona, Spain, October 16-17 2014.
ing SVM classifiers are found by conducting a grid search         [2] K. Van de Sande, T. Gevers, C. Snoek, ”Evaluating
with 5-fold cross validation on the training dataset.                 Color Descriptors for Object and Scene Recognition,”
                                                                      Pattern Analysis and Machine Intelligence, IEEE
4.   SUBMITTED RUNS                                                   Transactions on , vol.32, no.9, pp.1582,1596, Sept. 2010
  We select two training sets: (A) uses 14 videos, (B) uses       [3] H. Wang and C. Schmid. Action Recognition with
24 videos. We use the VSD 2013 test dataset (7 videos) as             Improved Trajectories. In Proceedings of the 2013
validation set. We employ a simple late fusion strategy on            IEEE International Conference on Computer Vision
the above features, using equal weights and learnt weights.           (ICCV ’13). IEEE Computer Society, Washington, DC,
We submitted five runs in total (Fig 1): (R1) using training          USA, 3551-3558.
set A, we first select the best still image feature and fuse it   [4] C.-C. Chang and C.-J. Lin. LIBSVM : a library for
with motion and audio features; (R2) using training set B,            support vector machines. ACM Transactions on
we fuse all still image features with motion and audio using          Intelligent Systems and Technology, 2:27:1–27:27, 2011.