TUDCL at MediaEval 2013 Violent Scenes Detection:
          Training with Multi-modal Features by MKL

                                           Shinichi Goto1 , Terumasa Aoki1, 2
                                             1 Graduate School of Information
                                          2 New Industry Creation Hachery Center
                                             Tohoku University, Miyagi, Japan
                                           {s-goto, aoki}@riec.tohoku.ac.jp

ABSTRACT
The purpose of this paper is to describe the work carried out
for the Violent Scenes Detection task at MediaEval 2013 by
team TUDCL. Our work is based on the combination of vi-
sual, temporal and audio features with machine learning at
segment-level. Block-saliency-map based dense trajectory is
proposed for visual and temporal features, and MFCC and
delta-MFCC is used for audio features. For the classifica-
tion, Multiple Kernel Learning is applied, which is eﬀective
if multi-modal features exist.

1. INTRODUCTION
  The MediaEval 2013 Aﬀect Task [1] is intended to detect
violence scenes in movies. Although two diﬀerent definitions
of violent events are provided this year, our algorithm is       Figure 1: Example of dense sampling using saliency
developed only to solve the task for the objective definition,   map: Original image (upper left), Normal dense
which is ”physical violence or accident resulting in human       sampling (upper right), Block saliency map (bottom
injury or pain.”                                                 left), Our dense sampling (bottom right).


2. APPROACH                                                      age is densely sampled with the smallest step size, which
   Rather than focusing on video shots from the beginning,       guarantees the more salient a block is, the more points are
our approach first handles fixed-length segments, each of        obtained there. Figure 1 shows one example of our dense
which has 20 frames (0.8 seconds if FPS is 25). After segment-   sampling and normal dense sampling. You notice that our
based scores are calculated from extracted feature vectors by    algorithm is sampling more points in salient regions and less
machine learning, shot-based scores are generated.               points in non-salient regions, but normal dense sampling, on
   For our runs only violent and non-violent ground truth        the other hand, is taking points more uniformly on a whole
are used, and neither a high-level concept nor external data     frame. Note points in the homogeneous areas have already
is used.                                                         been deleted.
                                                                    Trajectories, MBH, and additionally RGB histogram arou-
2.1 Visual and Temporal Features                                 nd trajectories are extracted for visual and temporal infor-
   Both visual and temporal features based on dense trajec-      mation, though in [2] HOG and HOF are also proposed. This
tory [2] are calculated at every frame. Although the original    is due to the fact that those features have poor contribution
dense trajectory algorithm is carried out by sampling frames     on our test runs.
densely except for homogeneous image areas, we addition-            All features are converted to Bag-of-Words form in each
ally apply saliency maps proposed by Itti [3] to increase the    segment to get 200-d trajectory, 200-d MBH-x, 200-d MBH-
precision, supposing that events concerned with violence are     y, and 400-d RGB histogram. In total, 1000-d feature vector
located in the areas people tend to pay attention to.            is used as the visual and temporal feature for classification.
   In our algorithm, first a normal saliency map is gener-
ated, and then it is transformed to a block-based map by         2.2   Audio Features
taking the average of salient values in a fixed block area so      Major MFCC, delta-MFCC and audio energy is calculated
that dense sampling can be applied, changing its sampling        every 20ms with 10ms overlap to create 200-d Bag-of-Audio-
step size and maximum spatial scale level according to the       Words in each segment, which has 0.8 seconds.
salient level. For instance, the most salient area in a im-
                                                                 2.3   Classifier Learning
                                                                   Although a conventional way of tackling this classifying
Copyright is held by the author/owner(s).                        problem is to use Support Vector Machine (SVM), we ap-
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain   ply Multiple Kernel Learning (MKL), which aims at finding
   Table 1: Weights diﬀerence learned by MKL.                          Table 2: Results of shot-level runs (Note all of them
  movie                   Audio   Traj.   MBHx     MBHy      RGB       are AED metrics).
  Armageddon              0.307   0.319   0.359    0.373     0.350        Run            MAP@100 Prec.        Rec. F-sc.
  The Sixth Sense         0.450   0.180   0.407    0.440     0.171        mkl-shot-hik-1    0.470      0.222 0.726 0.340
  Dead Poet Society       0.297   0.267   0.425    0.462     0.286
                                                                          mkl-shot-hik-2    0.470      0.284 0.609 0.387
                                                                          svm-shot-rbf        -       0.0976 0.738 0.172
optimized weights when multiple SVM kernels are applied
[4]. This suits well our case since multiple feature spaces ex-
                                                                              Table 3: Results of segment-level runs.
ist. The whole kernel is composed of multiple kernels, and
                                                                           Run          MAP@100 Prec.      Rec.    F-sc.
is computed according to the following equation:
                              ∑                                            mkl-seg-hik    0.343     0.214 0.309 0.253
                K(xi , xj ) =     dk Kk (xi , xj )          (1)            svm-seg-rbf      -       0.0473 0.466 0.0859
                                  k

where Kk are base kernels, and dk is a weight for each ker-
nel. In our case, kernels for trajectory, x-direction MBH, y-          of the scoring threshold (0.03 for the former, 0.06 for the
direction MBH, RGB-histogram and audio features are pre-               latter), and therefore it doesn’t aﬀect MAP@100. In addi-
pared. For a kernel function, Histogram Intersection Kernel            tion to our main runs, results by normal SVM with RBF
(HIK) is used since all of our features are histogram-based.           kernel are displayed for comparison, although there is no
   Although MKL can find optimal weights, we found these               MAP@100 score since only binary classification results are
values are diﬀerent depending on movies. Table 1 shows                 decided and no score is calculated for SVM.
the diﬀerence between weights learned from three diﬀerent                 Our results show the approach of Multiple Kernel Learn-
movies. Therefore first classifiers for training movies are            ing with HIK kernel is eﬀective for violent scenes detection,
learned separately to give binary classification for each seg-         though its F-score is still not high enough. We investigated
ment, and finally they are integrated in the following way.            this and came to the presumption that segments which have
                                                                       frequent camera motions, multiple people and loud sound
2.4 Integration                                                        tend to be mis-classified as violent.
                                                                          On the other hand, common missed violent segments are
  The first step here is to calculate a pre-final violence score
                                                                       violent scenes without sound, such as a scene in which a man
for each segment. To do so, for each segment in test movies,
                                                                       is wringing on an another man’s neck. It is reasonable to
we simply calculate the number of classifiers which classify
                                                                       suppose that segments in which multi-modality cannot be
that segment as violent. Therefore for each test movie, a
                                                                       exploited are likely to get missed.
score si for the ith segment is:
                                                                          Although MBH, which is proposed as robust to camera
        ∑
        M −1                                                           motions, is extracted, trajectories themselves easily get af-
 si =          ci (m), ci (n) = {0, 1} (n = 0, 1, . . . , M − 1) (2)   fected by camera motions, making them unreliable. There-
        m=0                                                            fore some action against this problem is imperative.
where ci (n) is a result of binary classification by the nth              It also should be added as classifiers have learned each
classifier with 0 for non-violence, 1 for violence. Note M is          training movie separately, feature vectors might not be enough
the total number of classifiers, which is equal to the number          compared to the case in which classifiers learn all movies
of training movies.                                                    simultaneously. Since not enough comparison with other
   Finally a moving average is calculated as smoothing method          methods have been done, we will continue our investigation.
for each test movie in order to decide final scores s′t for all
segments following:                                                    4. REFERENCES
                 ∑                                                     [1] C. Demarty, C. Penet, M. Schedl, B. Ionescu, V.L.
             si + N n=1 α · (si−n + si+n )
                         n
       s′i =                                 (0 < α < 1) (3)               Quang, and Y. Jiang. The MediaEval 2013 Aﬀect Task:
                        2N + 1                                             Violent Scenes Detection. In MediaEval 2013
where α is a smoothing coeﬃcient, N is a neighbor range                    Workshop, Barcelona, Spain, October 18-19 2013.
around a segment. We used 0.5 for α and 2 for N .                      [2] Heng Wang, Alexander Kläser, Cordelia Schmid, and
   The reason why this integration process is needed is to                 Cheng-Lin Liu. Action Recognition by Dense
take the continuity of segments into account. Besides, since               Trajectories. In IEEE Conference on Computer Vision
our classifier is learning each training movie separately, the             and Pattern Recognition, pages 3169–3176, Colorado
violence concepts which a training movie does not have can                 Springs, United States, June 2011.
be easily missed. Scores for shots are calculated by convert-          [3] Laurent Itti, Christof Koch, and Ernst Niebur. A
ing segment-based scores after calculating score per frame.                Model of Saliency-based Visual Attention for Rapid
If this score is higher than a threshold, that segment or shot             Scene Analysis. In IEEE Transactions on Pattern
is classified as violent. We choose 0.1 for a segment thresh-              Analysis and Machine Intelligence archive Volume 20
old, and 0.03 and 0.06 for shot thresholds.                                Issue 11, pages 1254–1259, IEEE Computer Society
                                                                           Washington, DC, USA, November 1988.
3. RESULTS AND DISCUSSION                                              [4] G.R.G. Lanckriet, N. Cristianini, P. Bartlett, L.E.
  Shot-based results of our runs are shown in Table 2, and                 Ghaoui, and M.I. Jordan. Learning the Kernel Matrix
segment-based results are shown in Table 3. The diﬀer-                     with Semidefinite Programming. In Journal of Machine
ence between mkl-shot-hik-1 and mkl-shot-hik-2 is the value                Learning Research 5, pages 27–72, 2004.