=Paper=
{{Paper
|id=None
|storemode=property
|title=The Vireo Team at MediaEval 2013: Violent Scenes Detection by Mid-level Concepts Learnt from Youtube
|pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_12.pdf
|volume=Vol-1043
|dblpUrl=https://dblp.org/rec/conf/mediaeval/TanN13
}}
==The Vireo Team at MediaEval 2013: Violent Scenes Detection by Mid-level Concepts Learnt from Youtube==
The Vireo Team at MediaEval 2013: Violent Scenes
Detection by Mid-level Concepts Learnt from Youtube
Chun Chet Tan Chong-Wah Ngo
Department of Computer Science Department of Computer Science
City University of Hong Kong, Hong Kong City University of Hong Kong, Hong Kong
cctan2-c@my.city.edu.hk cscwngo@cityu.edu.hk
Feature extraction
ABSTRACT Dense
Trajectories Concept Prediction
The Violent Scenes Detection task continues to pose chal- detectors scores
lenge in detecting violent scenes in Hollywood movies. In Video
shots
SIFT
SVM
classifiers
SVM
classifiers
this working notes paper, we present the framework of our CRF Prediction
refinement scores
system and briefly discuss the performance results obtained Audio features
in both objective and subjective subtasks. Besides using
the low-level features for training the SVM classifiers for vi-
Figure 1: Framework of our system for violent
olent scenes detection, we show the feasibility in using the
scenes detection.
concept detectors to infer the occurrence of violent scenes.
External Youtube data is exploited in our implementation
to provide more diverse definition to violent scene concepts.
sian and Hessian Affine, are adopted to locate locally invari-
Furthermore, we explore the feasibility of using Conditional
ant image patches from video frames. This feature is then
Random Fields (CRF) to refine the concept detection of
represented using the popular BoW framework, using two
movie shots holistically, given the relationships extracted
separate 500-d codebooks. Three spatial layers (1 × 1, 3 × 1
from ConceptNet and the co-occurrence information defined
and 2 × 2) are used in the vector quantization process, pro-
by normalized Google distance (NGD). We demonstrate solid
ducing a 8,000-dimensional feature vector by concatenating
improvements in performance by using mid-level concept
the features from both detectors.
based detectors and CRF refinement in both objective and
Audio Features: The MFCC features are densely ex-
subjective subtasks.
tracted from the audio track of the videos. However, we
found that MFCC is not sensitive to some audio dominant
1. INTRODUCTION concepts, e.g. explosions and gunshots. This has inspired us
This year, we explored several interesting possibilities in to investigate the other audio features. Due to the length
detecting the violent scenes in movies. Besides using the limit, we are not going to report on the performances of each
low-level features, we use the violence concept detectors to audio feature forth. The best result is obtained with the
infer the occurrence of the violent scenes. In addition, Con- combination of line spectral frequency (LSF), octave band
ditional Random Fields (CRF) are used as a refinement to signal intensity (OBSI), linear predictor coefficients (LPC),
improve the overall violence concept detection. MFCC and their first and second derivatives.
We train the SVM classifiers using mid-level concept based
features. These concept based features are composed of the
2. SYSTEM DESCRIPTION prediction output of the violence concept detectors. The de-
Figure 1 shows the overview of our system framework. A tectors are trained using the aforementioned low-level audio-
diverse set of audio-visual features are extracted for training visual features. Kernel-level early fusion (mean of features)
the χ2 SVM classifiers for violent scenes detection. These is used to fuse all these low-level features. Ten violence con-
low-level features include: cepts are provided by MediaEval [2], such as “fights”, “explo-
Dense Trajectories: The features are extracted using sions”, “gun shots”, etc. We use these 10 violence concepts to
the method of [5]. Each trajectory is described by three fea- infer the other 42 extra violence concepts from the Concept-
tures, namely histogram of oriented gradients (HOG), his- Net [4] and we train these extra violence concepts using the
togram of optical flow (HOF) and motion boundary history Youtube video clips, which are crawled using keywords and
(MBH). Including the trajectory shape features, we have 4 tags without human inspection. The motivation behind this
features in total. Each of these features encodes some com- is to build an event network with more diverse violence con-
plementary information in the videos. HOG encodes the lo- cepts. The violence-related concepts are depicted in Table
cal appearance information while the local motion patterns 1.
are captured by the HOF and MBH. We detect the occurrence of these violence concepts in the
SIFT: Two sparse keypoint detectors, Difference of Gaus- video shots and use the detection scores as the features to the
SVM classifiers. Since we have collected 52 violence concepts
from the ConceptNet, a graphical model can be generated
Copyright is held by the author/owner(s). to represent the violence concepts and their relationships
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain based on the ontology of ConceptNet. We incorporate the
Table 1: 52 violence-related concepts inferred from Table 2: Performance of our system for violent
ConceptNet, including the 10 violence concepts (un- scenes detection.
derlined) defined by MediaEval. Objective Subtask Subjective Subtask
mAP@20 mAP@100 mAP@20 mAP@100
accident club gun person slap
Low-level Feat. 0.6167 0.5909 0.7223 0.6942
action cold arm gunshot pull stab
Concept 0.6294 0.5749 0.7688 0.7162
arm explosion hand punch stick Concept+CRF 0.6111 0.6063 0.8064 0.7306
beat fall harm punishment victim Late Fusion 0.6509 0.6195 0.7996 0.7429
bleed fight help push violence
blood fire hit rape war
bomb firearm horror roll whip
bone foot hurt rope woman 2.1 Submitted Runs
break force kick scream
bullet gang machine gun shock As depicted in Table 2, we submitted four runs based on
car chase gore murder shoot the aforementioned features, namely the low-level features
(the baseline), the mid-level concept based features, the mid-
level concept based features with CRF refinement and the
late fusion of all the runs.
3. RESULTS AND DISCUSSION
Table 2 shows the performance1 of our system. Each re-
sult is obtained from the mean of five repeated sets of run.
It can be seen that detection using concept based features
is superior to low-level features in overall. In particular,
a more significant improvement is shown in the subjective
subtask. The effect of CRF refinement can also be observed
from the runs with CRF compared to the counterparts with
no CRF. If compared to the baseline, the CRF runs in sub-
Figure 2: CRF refinement example shown in a par- jective subtask show a solid performance improvement of
tial event network. The retained and discarded con- 11.6% and 5.2% in mAP@20 and mAP@100 respectively.
cepts are circled in green and red respectively. The results indeed show the use of CRF to consider the con-
cept detection holistically using co-occurrence information
is effective. It is on the other hand shows that the structure
co-occurrence information of these violence concepts into the of the event network derived from the ConceptNet is useful.
CRF for detection refinement. For example, “gory scenes” Finally, the runs with late fusion benefit from having the
is normally co-occur with “blood” concept. Our objective advantages of the other runs and show the highest mAP in
is to retain certain concepts and discard the others in the three out of four evaluations.
event network for a particular video clip. A pairwise energy
function which is making use of the detection output and
also incorporating the co-occurrence statistics is proposed
4. ACKNOWLEDGMENTS
as follows: The work described in this paper was fully sponsored by
X X a grant from the National Natural Science Foundation of
E(X) = ψvi (xi ) + δvi vj (xi , xj ) (1) China (61272290) and was fully supported by the Shenzhen
vi ∈V (vi ,vj )∈N Research Institute, City University of Hong Kong.
where the unary potential ψvi is defined over the retention of 5. REFERENCES
concepts in the graph, based upon the classifier responses, [1] R. L. Cilibrasi and P. M. B. Vitanyi. The google similarity
i.e. the detection scores of SVMs. The pairwise potential distance. IEEE Trans. on Knowl. and Data Eng.,
δvi vj is defined over the co-occurrence of concepts, where 19(3):370–383, Mar. 2007.
[2] C.-H. Demarty, C. Penet, M. Schedl, B. Ionescu, V. L.
normalized Google distance (NGD) [1] is adopted. Graph
Quang, and Y.-G. Jiang. The MediaEval 2013 Affect Task:
cut is used to minimize the energy function. The refinement Violent Scenes Detection. In MediaEval 2013 Workshop,
process is carried out before feeding into the SVM for violent 2013.
scenes detection. Figure 2 shows an example of the CRF re- [3] Y.-G. Jiang, Q. Dai, C. C. Tan, X. Xue, and C.-W. Ngo.
finement. The originally detected concepts include “action”, The shanghai-hongkong team at mediaeval2012: Violent
“explosion”, “fight”, “fire”, “gunshot” and “scream”. After scene detection using trajectory-based features. In
CRF refinement, only “explosion” and “fire” are retained. MediaEval 2012 Workshop, 2012.
The unary potentials (the detection scores) of the discarded [4] H. Liu and P. Singh. Conceptnet – a practical commonsense
reasoning tool-kit. BT Technology Journal, 22(4):211–226,
concepts, although beyond the thresholds, are surpassed by Oct. 2004.
the pairwise potentials (the co-occurrence information) in [5] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Action
the energy based model where the detection is considered Recognition by Dense Trajectories. In IEEE Conference on
holistically. Computer Vision & Pattern Recognition, pages 3169–3176,
As we found that score smoothing [3] was very useful in Colorado Springs, United States, June 2011.
improving the result performance last year, it is adopted
for all the final prediction scores. The prediction scores are
averaged over a three-shot windows along the timeline of 1
The results are obtained with amended thresholds, different from
each movie. the official submissions.