NII-UIT at MediaEval 2013
                        Violent Scenes Detection Affect Task

                    Vu Lam                             Duy-Dinh Le                            Sang Phan
           University of Science                    National Institute of                 National Institute of
         227 Nguyen Van Cu, Dist.5                      Informatics                           Informatics
           Ho Chi Minh, Vietnam                     2-1-2 Hitotsubashi,                   2-1-2 Hitotsubashi,
         lqvu@fit.hcmus.edu.vn                          Chiyoda-ku                            Chiyoda-ku
                                                  Tokyo, Japan 101-8430                 Tokyo, Japan 101-8430
                                                ledduy@nii.ac.jp           plsang@nii.ac.jp
                                    Shin’ichi Satoh            Duc Anh Duong
                                   National Institute of           University of Information
                                       Informatics                       Technology
                                   2-1-2 Hitotsubashi,           KM20 Ha Noi highway, Linh
                                       Chiyoda-ku                Trung Ward,Thu Duc District
                                 Tokyo, Japan 101-8430              Ho Chi Minh, Vietnam
                                   satoh@nii.ac.jp                    ducda@uit.edu.vn

ABSTRACT                                                         feature, at first we build attribute classifiers for 7 visual
We present a comprehensive evaluation of shot-based visual       attributes: fights, blood, gore, fire, car chase, cold arms,
and audio features for MediaEval 2013 - Violent Scenes De-       firearms. After that, we concatenate output scores of each
tection Affect Task. To obtain visual features, we use global    attribute classifier to form the mid-level feature representa-
features, local SIFT features and motion features. For audio     tion. For all features, we use the popular SVM algorithm
features, the popular MFCC is employed. Besides that, we         for learning. Finally, the probability output scores of the
also evaluate the performance of mid-level features which is     learned classifier are used for ranking retrieved shots.
constructed using visual concepts. We combined these fea-           We use the same framework for evaluating both objective
tures using late fusion. The results obtained by our runs are    and subjective tasks (just different annotations). Our results
presented.                                                       show that the combined runs using all visual, audio and mid-
                                                                 level features achieved the best performance.

Keywords
                                                                 2.    LOW LEVEL FEATURE
semantic concept detection, global feature, local feature, mo-
                                                                   We use feature from different modalities to test if they are
tion feature, audio feature, mid-level feature, late fusion
                                                                 complementary for violent scenes detection. Currently, we
                                                                 have developed our VSD system to incorporate still image
1.   INTRODUCTION                                                feature, motion feature, and audio feature.
   We have developed NII-KAORI-SECODE, a general frame-
work for semantic concept detection, and used it to partic-      2.1    Still Image Features
ipate in several benchmarks such as IMAGECLEF, MEDI-                We use both global and local features for VSD because
AEVAL, PASCAL-VOC, IMAGE-NET and TRECVID. In                     they capture different characteristics of images. For global
this year, we evaluate performance for concept detection-        feature, we use Color Histogram (CH), Color Moment (CM),
like task using shot-based feature representations only. Our     Edge Oriented Histogram (EOH), and Local Binary Pattern
previous works show that using the shot-based features not       (LBP). For local feature, we use popular SIFT with both
only reduce the computational cost but also improve the          Hessian Laplace interest points and dense sampling at mul-
performance.                                                     tiple scales. For dense sampling, besides the standard SIFT
   We consider the Violent Scenes Detection (VSD) Task [1]       descriptor, we also use Opponent-SIFT and C-SIFT. For in-
as a concept detection task and use the NII-KAORI-SECODE         terest point detector, we only use normal SIFT descriptor.
framework for evaluation. Firstly, keyframes are extracted       We also employed the bag-of-words model with a codebook
by sampling 5 keyframes/second. Raw features are extracted       size of 1000 and the soft-assignment technique to generate
for all keyframes in each shot and then shot-based features      a fixed-dimension feature representation for each keyframe.
are formed from its keyfame-based feature by applying av-        Beside encoding the whole image, we also divided it into
erage or max pooling. Motion feature and audio feature           grids of 3x1 and 2x2 to encode spatial information. Finally,
are extracted directly from the whole shot. For mid-level        in order to generate a single representation for each shot, we
                                                                 employed two pooling strategies: average pooling and max
                                                                 pooling.
Copyright is held by the author/owner(s).
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain
                                                                 2.2    Motion Feature
   Trajectories are obtained by tracking the densely sampled
points in the optical flow fields. We use Motion Boundary
Histogram (MBH) to describe each trajectory. This feature
descriptor is know to perform well for handling camera mo-
tion. For motion feature we use Fisher vector encoding after
reducing feature dimension using PCA. The codebook size is
256, trained using a Gaussian Mixture Model (GMM). The
final feature dimension is 65,536.

2.3    Audio Feature
   We use the popular MFCC for extracting audio feature.
We choose a length of 25ms for audio segments and a step
size of 10ms. The 13d MFCCs along with each first and sec-
ond derivatives are used for representing each audio segment.           Figure 1: Mid-level feature construction
Raw MFCC features are also encoded using Fisher vector.
We use GMM to build the codebook with 256 clusters. We
also apply PCA to reduce feature dimension, resulting fea-
ture descriptors of 12,288 dimensions.

3.    MID-LEVEL FEATURE
   Beside low-level features, we also investigate how to use
related violent information as mid-level feature to detect vi-
olent scenes. We use only seven violent concepts to create
attributes: fire, firearms, cold arms, car chase, gore, blood,
and fight. We use low-level image feature to train the at-
tribute classifiers on the VSD development set of 2011. For
each image, we apply these attribute classifiers to get score
values corresponding to each attribute. After that, we con-
catenate all these values to form the mid-level representation
of each image. We then train our mid-level classifier on the
VSD development and test set of 2012. Finally, this classi-             Figure 2: Results of our submitted runs
fier is used for testing on this year’s set. The detailed work
flow is shown in Figure 1.
                                                                  and subjective tasks. For each task, we report two evalua-
4.    CLASSIFICATION                                              tion metrics: overall MAP and MAP100, which is the MAP
                                                                  at top 100 return shots. Our best run is the fusion run of
   LibSVM is used for training and testing at shot level          all global, local, motion and audio feature (R1). This ob-
(based on shot boundaries provided by the organizers). To         servation confirm the benefit of combining multiple features
generate training data, shots which fall into positive seg-       for violent scenes detection. Among all submitted runs, the
ments more than 80% will be considered as positive shots.         run using mid-level (R3) performs the worst. However, it
The remaining shots are considered as negative. Extracted         can be complementary for combining with other low-level
features are scaled to [0, 1] using the svm-scale tool of Lib-    features (R1). The combined run using motion feature and
SVM. For still image features, we use a chi-square kernel to      audio (R4) feature did not achieve good results as expected.
calculate the distance matrix. For audio and motion feature,      In fact, its performance is lower than the combined run of
which are encoded using fisher vector, a linear kernel is used.   still image features (R5). This can be due to minor motion
The optimal gamma and cost parameters for learning SVM            in each shots and/or noise in audio signals.
classifiers are found by conducting a grid search with 5-fold        Our future study includes investigating the contribution of
cross validation on the training dataset.                         motion features and audio features. The result of mid-level
                                                                  features is also promising. Currently, we only use 7 visual
5.    SUBMITTED RUNS                                              concepts for constructing mid-level features. In the future,
   We employ a simple late fusion strategy on the aforemen-       we will incorporate audio concepts using audio feature.
tioned low-level and mid-level features, giving equal weights
to the different factors. We submitted five runs in total:        7.   ACKNOWLEDGEMENTS
(R5) Fusion of all 4 global features and 5 local features;          This research is partially funded by Vietnam National
(R4) Fusion of motion feature (dense trajectories + MBH)          University Ho Chi Minh City (VNU-HCM) under grant num-
and audio feature (MFCC); (R3) The run using mid-level            ber B2013-26-01.
feature; (R2) Fusion of R4 and R5; and (R1) Fusion of R3,
R4 and R5.                                                        8.   REFERENCES
                                                                  [1] Demarty C.H, Penet C., Schedl M., Ionescu B., Lam Q.
6.    RESULTS AND DISCUSSIONS                                         V. and Jiang Y. G. The MediaEval 2013 Affect Task:
  The detailed performance for each submitted run is shown            Violent Scenes Detection, MediaEval 2013 Workshop,
in Figure 2. We report the performance of both objective              October 18-19, 2013, Barcelona, Spain.