=Paper=
{{Paper
|id=None
|storemode=property
|title=The Vireo Team at MediaEval 2013: Violent Scenes Detection by Mid-level Concepts Learnt from Youtube
|pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_12.pdf
|volume=Vol-1043
|dblpUrl=https://dblp.org/rec/conf/mediaeval/TanN13
}}
==The Vireo Team at MediaEval 2013: Violent Scenes Detection by Mid-level Concepts Learnt from Youtube==
<pdf width="1500px">https://ceur-ws.org/Vol-1043/mediaeval2013_submission_12.pdf</pdf>
<pre>
       The Vireo Team at MediaEval 2013: Violent Scenes
      Detection by Mid-level Concepts Learnt from Youtube

                         Chun Chet Tan                                             Chong-Wah Ngo
                Department of Computer Science                          Department of Computer Science
            City University of Hong Kong, Hong Kong                 City University of Hong Kong, Hong Kong
                   cctan2-c@my.city.edu.hk                                  cscwngo@cityu.edu.hk

                                                                           Feature extraction
ABSTRACT                                                                        Dense
                                                                              Trajectories                    Concept                    Prediction
The Violent Scenes Detection task continues to pose chal-                                                     detectors                   scores
lenge in detecting violent scenes in Hollywood movies. In         Video
                                                                  shots
                                                                                 SIFT
                                                                                                   SVM
                                                                                                classifiers
                                                                                                                              SVM
                                                                                                                           classifiers
this working notes paper, we present the framework of our                                                         CRF                    Prediction
                                                                                                              refinement                  scores
system and briefly discuss the performance results obtained                 Audio features

in both objective and subjective subtasks. Besides using
the low-level features for training the SVM classifiers for vi-
                                                                  Figure 1: Framework of our system for violent
olent scenes detection, we show the feasibility in using the
                                                                  scenes detection.
concept detectors to infer the occurrence of violent scenes.
External Youtube data is exploited in our implementation
to provide more diverse definition to violent scene concepts.
                                                                  sian and Hessian Affine, are adopted to locate locally invari-
Furthermore, we explore the feasibility of using Conditional
                                                                  ant image patches from video frames. This feature is then
Random Fields (CRF) to refine the concept detection of
                                                                  represented using the popular BoW framework, using two
movie shots holistically, given the relationships extracted
                                                                  separate 500-d codebooks. Three spatial layers (1 × 1, 3 × 1
from ConceptNet and the co-occurrence information defined
                                                                  and 2 × 2) are used in the vector quantization process, pro-
by normalized Google distance (NGD). We demonstrate solid
                                                                  ducing a 8,000-dimensional feature vector by concatenating
improvements in performance by using mid-level concept
                                                                  the features from both detectors.
based detectors and CRF refinement in both objective and
                                                                     Audio Features: The MFCC features are densely ex-
subjective subtasks.
                                                                  tracted from the audio track of the videos. However, we
                                                                  found that MFCC is not sensitive to some audio dominant
1. INTRODUCTION                                                   concepts, e.g. explosions and gunshots. This has inspired us
  This year, we explored several interesting possibilities in     to investigate the other audio features. Due to the length
detecting the violent scenes in movies. Besides using the         limit, we are not going to report on the performances of each
low-level features, we use the violence concept detectors to      audio feature forth. The best result is obtained with the
infer the occurrence of the violent scenes. In addition, Con-     combination of line spectral frequency (LSF), octave band
ditional Random Fields (CRF) are used as a refinement to          signal intensity (OBSI), linear predictor coefficients (LPC),
improve the overall violence concept detection.                   MFCC and their first and second derivatives.
                                                                     We train the SVM classifiers using mid-level concept based
                                                                  features. These concept based features are composed of the
2. SYSTEM DESCRIPTION                                             prediction output of the violence concept detectors. The de-
  Figure 1 shows the overview of our system framework. A          tectors are trained using the aforementioned low-level audio-
diverse set of audio-visual features are extracted for training   visual features. Kernel-level early fusion (mean of features)
the χ2 SVM classifiers for violent scenes detection. These        is used to fuse all these low-level features. Ten violence con-
low-level features include:                                       cepts are provided by MediaEval [2], such as “fights”, “explo-
  Dense Trajectories: The features are extracted using            sions”, “gun shots”, etc. We use these 10 violence concepts to
the method of [5]. Each trajectory is described by three fea-     infer the other 42 extra violence concepts from the Concept-
tures, namely histogram of oriented gradients (HOG), his-         Net [4] and we train these extra violence concepts using the
togram of optical flow (HOF) and motion boundary history          Youtube video clips, which are crawled using keywords and
(MBH). Including the trajectory shape features, we have 4         tags without human inspection. The motivation behind this
features in total. Each of these features encodes some com-       is to build an event network with more diverse violence con-
plementary information in the videos. HOG encodes the lo-         cepts. The violence-related concepts are depicted in Table
cal appearance information while the local motion patterns        1.
are captured by the HOF and MBH.                                     We detect the occurrence of these violence concepts in the
  SIFT: Two sparse keypoint detectors, Difference of Gaus-        video shots and use the detection scores as the features to the
                                                                  SVM classifiers. Since we have collected 52 violence concepts
                                                                  from the ConceptNet, a graphical model can be generated
Copyright is held by the author/owner(s).                         to represent the violence concepts and their relationships
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain    based on the ontology of ConceptNet. We incorporate the
Table 1: 52 violence-related concepts inferred from                            Table 2: Performance of our system for violent
ConceptNet, including the 10 violence concepts (un-                            scenes detection.
derlined) defined by MediaEval.                                                                     Objective Subtask       Subjective Subtask
                                                                                                  mAP@20      mAP@100     mAP@20      mAP@100
  accident           club            gun            person          slap
                                                                                Low-level Feat.    0.6167      0.5909      0.7223       0.6942
  action             cold arm        gunshot        pull            stab
                                                                                Concept            0.6294      0.5749      0.7688       0.7162
  arm                explosion       hand           punch           stick       Concept+CRF        0.6111      0.6063      0.8064       0.7306
  beat               fall            harm           punishment      victim      Late Fusion        0.6509      0.6195      0.7996       0.7429
  bleed              fight           help           push            violence
  blood              fire            hit            rape            war
  bomb               firearm         horror         roll            whip
  bone               foot            hurt           rope            woman      2.1    Submitted Runs
  break              force           kick           scream
  bullet             gang            machine gun    shock                         As depicted in Table 2, we submitted four runs based on
  car chase          gore            murder         shoot                      the aforementioned features, namely the low-level features
                                                                               (the baseline), the mid-level concept based features, the mid-
                                                                               level concept based features with CRF refinement and the
                                                                               late fusion of all the runs.

                                                                               3.    RESULTS AND DISCUSSION
                                                                                  Table 2 shows the performance1 of our system. Each re-
                                                                               sult is obtained from the mean of five repeated sets of run.
                      
                                        
                                                                     
                                                                               It can be seen that detection using concept based features
                                                                         is superior to low-level features in overall. In particular,
                                                        
                                                                               a more significant improvement is shown in the subjective
                                             
                                                                       subtask. The effect of CRF refinement can also be observed
                                                                               from the runs with CRF compared to the counterparts with
                                                                               no CRF. If compared to the baseline, the CRF runs in sub-
Figure 2: CRF refinement example shown in a par-                               jective subtask show a solid performance improvement of
tial event network. The retained and discarded con-                            11.6% and 5.2% in mAP@20 and mAP@100 respectively.
cepts are circled in green and red respectively.                               The results indeed show the use of CRF to consider the con-
                                                                               cept detection holistically using co-occurrence information
                                                                               is effective. It is on the other hand shows that the structure
co-occurrence information of these violence concepts into the                  of the event network derived from the ConceptNet is useful.
CRF for detection refinement. For example, “gory scenes”                       Finally, the runs with late fusion benefit from having the
is normally co-occur with “blood” concept. Our objective                       advantages of the other runs and show the highest mAP in
is to retain certain concepts and discard the others in the                    three out of four evaluations.
event network for a particular video clip. A pairwise energy
function which is making use of the detection output and
also incorporating the co-occurrence statistics is proposed
                                                                               4.    ACKNOWLEDGMENTS
as follows:                                                                      The work described in this paper was fully sponsored by
                  X               X                                            a grant from the National Natural Science Foundation of
        E(X) =        ψvi (xi ) +        δvi vj (xi , xj ) (1)                 China (61272290) and was fully supported by the Shenzhen
                         vi ∈V               (vi ,vj )∈N                       Research Institute, City University of Hong Kong.

where the unary potential ψvi is defined over the retention of                 5.    REFERENCES
concepts in the graph, based upon the classifier responses,                    [1] R. L. Cilibrasi and P. M. B. Vitanyi. The google similarity
i.e. the detection scores of SVMs. The pairwise potential                          distance. IEEE Trans. on Knowl. and Data Eng.,
δvi vj is defined over the co-occurrence of concepts, where                        19(3):370–383, Mar. 2007.
                                                                               [2] C.-H. Demarty, C. Penet, M. Schedl, B. Ionescu, V. L.
normalized Google distance (NGD) [1] is adopted. Graph
                                                                                   Quang, and Y.-G. Jiang. The MediaEval 2013 Affect Task:
cut is used to minimize the energy function. The refinement                        Violent Scenes Detection. In MediaEval 2013 Workshop,
process is carried out before feeding into the SVM for violent                     2013.
scenes detection. Figure 2 shows an example of the CRF re-                     [3] Y.-G. Jiang, Q. Dai, C. C. Tan, X. Xue, and C.-W. Ngo.
finement. The originally detected concepts include “action”,                       The shanghai-hongkong team at mediaeval2012: Violent
“explosion”, “fight”, “fire”, “gunshot” and “scream”. After                        scene detection using trajectory-based features. In
CRF refinement, only “explosion” and “fire” are retained.                          MediaEval 2012 Workshop, 2012.
The unary potentials (the detection scores) of the discarded                   [4] H. Liu and P. Singh. Conceptnet – a practical commonsense
                                                                                   reasoning tool-kit. BT Technology Journal, 22(4):211–226,
concepts, although beyond the thresholds, are surpassed by                         Oct. 2004.
the pairwise potentials (the co-occurrence information) in                     [5] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Action
the energy based model where the detection is considered                           Recognition by Dense Trajectories. In IEEE Conference on
holistically.                                                                      Computer Vision & Pattern Recognition, pages 3169–3176,
   As we found that score smoothing [3] was very useful in                         Colorado Springs, United States, June 2011.
improving the result performance last year, it is adopted
for all the final prediction scores. The prediction scores are
averaged over a three-shot windows along the timeline of                       1
                                                                                 The results are obtained with amended thresholds, different from
each movie.                                                                    the official submissions.

</pre>