=Paper=
{{Paper
|id=None
|storemode=property
|title=MediaEval 2011 Affect Task: Violent Scene Detection combining audio and visual Features with SVM
|pdfUrl=https://ceur-ws.org/Vol-807/acar_TUB_Violence_me11wn.pdf
|volume=Vol-807
|dblpUrl=https://dblp.org/rec/conf/mediaeval/AcarSA11
}}
==MediaEval 2011 Affect Task: Violent Scene Detection combining audio and visual Features with SVM==
<pdf width="1500px">https://ceur-ws.org/Vol-807/acar_TUB_Violence_me11wn.pdf</pdf>
<pre>
       MediaEval 2011 Affect Task: Violent Scene Detection
         combining Audio and Visual Features with SVM

                                 Esra Acar, Stephan Spiegel, Sahin Albayrak
                         DAI Labor, Berlin University of Technology, Berlin, Germany
                          {esra.acar, stephan.spiegel, sahin.albayrak}@dai-labor.de


ABSTRACT                                                             fusion manner by one-class SVM [7] and two-class SVM [1].
We propose an approach for violence analysis of movies in            We report and discuss our results on 3 Hollywood movies
a multi-modal (visual and audio) manner with one-class               from the MediaEval 2011 dataset [2].
and two-class support vector machine (SVM). We use the
scale-invariant feature transform (SIFT) features with the           2.    PROPOSED APPROACH
Bag-of-Words (BoW) approach for visual content descrip-                 We propose an approach that merges visual and audio
tion of movies, where audio content description is performed         features in a supervised manner (with one-class and two-
with the mel-frequency cepstral coefficients (MFCCs) fea-            class SVM) for violence detection in movies. The main idea
tures. We investigate the performance of combining visual            behind one-class SVM is to construct a hyper-sphere that
and audio features in an early fusion manner to describe the         contains most of the positive training examples. The hyper-
violence in movies. The experimental results suggest that            sphere aims to separate the positive training examples from
one-class SVM is a promising approach for the task.                  the rest of the world. The hyper-sphere is determined with
                                                                     two parameters which are v (an upper bound on the fraction
Categories and Subject Descriptors                                   of outliers) and σ (the kernel width). Two-class SVM on
H.3 [Information Storage and Retrieval]: Content Anal-               the other hand constructs a hyperplane in the feature space
ysis and Indexing                                                    to achieve a good separation between positive and negative
                                                                     examples (i.e. maximum distance between the hyperplane
                                                                     and the nearest training examples of any two class).
General Terms                                                           For video content description, low-level visual and audio
Algorithms, Performance, Experimentation                             features of video shots of the movies are extracted. The low-
                                                                     level features are then combined in an early fusion manner to
Keywords                                                             train SVMs. The multi-modal fusion scheme of our approach
                                                                     is given in Figure 1.
Violence detection, SVM, SIFT, Bag-of-Words, MFCCs

1.   MOTIVATION AND RELATED WORK
   Although video content analysis has been studied exten-
sively in the literature, violence analysis of movies is re-
stricted to a few studies [3], [4], [5]. Hence, the motivation
of MediaEval 2011 Affect Task is the automatic violence con-
tent multi-modal analysis of movies to enable helping par-
ents to review the most violent scenes in a movie to prevent
their children from watching them. Detailed description of
the task, the dataset, the ground truth and evaluation cri-
teria are given in the paper by Demarty et al. [2].
   Lin et al.[5] proposed a co-training based approach, where
audio analysis by a modified pLSA algorithm, motion and                      Figure 1: Multi-modal Fusion Scheme
high-level visual concept analysis was performed. Gong et
al. applied a semi-supervised learning approach [4], where
low-level visual and audio features were fused with high-            2.1   Audio Features
level audio indicators. Giannakopoulos et al. proposed a                To describe the audio content of the movies, we use MFCCs
multi-modal probabilistic late fusion approach [3]. For the          that are commonly used in audio recognition [6]. Due to the
MediaEval 2011 Affect Task, we apply multi-modal (audio              variability in duration of the annotated video shots, each
and visual) analysis as in these studies [3], [4], [5] in an early   video shot has different numbers of MFCCs. Since we want
                                                                     to describe each video shot by one significant feature vector,
                                                                     we compute the mean and standard deviation for each di-
Copyright is held by the author/owner(s).                            mension of the MFCCs feature vectors to describe the audio
MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy            signal.
2.2   Visual Features                                           almost every shot as violent in svm1(cf1:10), because the
   SIFT based BoW approach is used for visual content de-       cost of a miss is ten times higher than the cost of a false
scription. As in BoW approaches known e.g. from [8],            alarm. In svm1(cf1:1), the number of video shots classified
a visual vocabulary is constructed by clustering SIFT lo-       as violent gets lower, since the costs of a false alarm and a
cal feature vectors with the k -means clustering algorithm.     miss are equal. When all of the three runs are considered,
Each resulting cluster is treated as a visual word. Once a      two-class SVM achieves the poorest performance according
visual vocabulary of size k (k = 350 in this work) is built,    to the cost measure.
each SIFT feature is assigned to the closest visual word (Eu-     We observed that one-class SVM tends to classify most of
clidean distance is used), a histogram is computed for the      the video shots in the movies as violent even if equal costs are
keyframe of a video shot and the related video shot is rep-     used for false alarm and miss. This may happen because of
resented as BoW histogram that represents the visual word       two reasons: (1) low-level audio and visual features are not
occurrences in its keyframe.                                    selective enough to describe the violence, (2) sub-optimal
                                                                parameters are being used for SVM model construction.
2.3   Results and Evaluation
   The aim of this work is to assess the performance of one-    3.   CONCLUSIONS
class SVM and two-class SVM for violence detection. We             We applied one-class and two-class SVM approach for vio-
evaluated our approach on 3 Hollywood movies from the           lence detection in movies. Our main finding is that one-class
MediaEval 2011 dataset [2]. We submitted three runs in          SVM seems a promising approach for the task. SVM param-
total for the MediaEval 2011 Affect Task: svm1(cf1:10),         eters (v, σ) and the low-level audio and visual features used
svm1(cf1:1) and svm2(cf1:10). We applied one-class SVM          for the task need to be analyzed in more detail for better
with RBF kernel in svm1(cf1:10) and svm1(cf1:1) submis-         results in terms of false alarm rate. Future work will in-
sions, where two-class SVM with RBF kernel was applied in       volve enhancing optimal SVM parameter selection process
svm2(cf1:10) submission. The cost function mentioned in [2]     and more detailed analysis of the audio and visual features
was used during SVM parameter selection for svm1(cf1:10)        for content description to reduce the false alarm rate.
and svm2(cf1:10), where for svm1(cf1:1) submission cost            Acknowledgments We wish to thank Brijnesh Johannes
function was adapted (i.e. Cfa =1 and Cmiss =1). Param-         Jain for his comments and suggestions that greatly con-
eter optimization process was performed by separating ran-      tributed to the successful completion of this work.
domly the training data at hand into training, validation
and test sets and the parameter values that gave the min-       4.   REFERENCES
imum cost according to the mentioned cost function were
                                                                [1] C. Cortes and V. Vapnik. Support-vector networks. In
chosen. The optimized SVM parameters were v and σ for
                                                                    Machine Learning, pages 273–297, 1995.
one-class SVM, where c and σ parameters were optimized for
two-class SVM. LibSvm1 was used as the SVM implementa-          [2] C.-H. Demarty, C. Penet, G. Gravier, and
tion. We employed the Auditory Toolbox2 and David Lowe’s            M. Soleymani. The MediaEval 2011 Affect Task:Violent
SIFT demo software3 to extract the 13-dimensional MFCCs             Scenes Detection in Hollywood Movies. In MediaEval
and 128-dimensional SIFT features, respectively. Table 1            2011 Workshop, Pisa, Italy, September 1-2 2011.
reports the number of false alarms (out of 3871) and miss       [3] T. Giannakopoulos, A. Makris, D. Kosmopoulos,
detections (out of 629) and Table 2 gives the evaluation re-        S. Perantonis, and S. Theodoridis. Audio-visual fusion
sults. AED-P, AED-R and AED-F correspond to AED[2]                  for detecting violent scenes in videos. In Artificial
precision, AED recall and AED F-measure, respectively.              Intelligence: Theories, Models and Applications, vol.
                                                                    6040 of Lecture Notes in Computer Science, pages
           Table 1: Misclassified video shots                       91–100, 2010.
            Run          Miss        False alarm                [4] Y. Gong, W. Wang, S. Jiang, Q. Huang, and G. W.
       svm1(cf1:10)  18 (%2,86)     3776 (%97,55)                   Detecting violent scenes in movies by auditory and
        svm1(cf1:1) 213 (%33,86) 2781 (%71,84)                      visual cues. In Advances in Multimedia Information
       svm2(cf1:10) 363 (%57,71) 1350 (%35,71)                      Processing - PCM 2008, vol. 5353 of Lecture Notes in
                                                                    Computer Science, pages 317–326, 2008.
Table 2: Evaluation results for the submitted runs
     Run      AED-P AED-R AED-F AED Cost                        [5] J. Lin and W. Wang. Weakly-Supervised Violence
 svm1(cf1:10) 0,1393   0,9714    0,2436    1,262                    Detection in Movies with Audio and Video Based
                                                                    Co-training. In Advances in Multimedia Information
  svm1(cf1:1) 0,1301   0,6614    0,2175    4,105
                                                                    Processing - PCM 2009, vol. 5879 of Lecture Notes in
 svm2(cf1:10) 0,1646   0,4229     0,237     6,12
                                                                    Computer Science, pages 930–935, 2009.
                                                                [6] B. Logan. Mel frequency cepstral coefficients for music
  The minimum miss rate is achieved with svm1(cf1:10),              modeling. In Int. Symposium on Music Information
where svm2(cf1:10) has the minimum false alarm rate. How-           Retrieval, 2000.
ever, svm2(cf1:10) has the poorest cost value due to the miss
                                                                [7] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola,
rate. On the other hand, svm1(cf1:10) and svm1(cf1:1) have
                                                                    and R. C. Williamson. Estimating the support of a
smaller miss rates, where their false alarm rate is higher
                                                                    high-dimensional distribution. Neural Computation,
compared to svm2(cf1:10). The best cost is achieved with
                                                                    13(7):1443–1471, 2001.
svm1(cf1:10). However, the SVM classifier tends to classify
                                                                [8] T. Zhang, C. Xu, G. Zhu, S. Liu, and L. H. A Generic
1
  http://www.csie.ntu.edu.tw/˜cjlin/libsvm/                         Framework for Event Detection in Various Video
2
  http://cobweb.ecn.purdue.edu/˜malcolm/interval/1998-010           Domains. In ACM MM, Firenze, Italy, October 25-29
3
  http://www.cs.ubc.ca/˜lowe/keypoints/                             2010.

</pre>