=Paper=
{{Paper
|id=None
|storemode=property
|title=MediaEval 2011 Affect Task: Violent Scene Detection combining audio and visual Features with SVM
|pdfUrl=https://ceur-ws.org/Vol-807/acar_TUB_Violence_me11wn.pdf
|volume=Vol-807
|dblpUrl=https://dblp.org/rec/conf/mediaeval/AcarSA11
}}
==MediaEval 2011 Affect Task: Violent Scene Detection combining audio and visual Features with SVM==
MediaEval 2011 Affect Task: Violent Scene Detection combining Audio and Visual Features with SVM Esra Acar, Stephan Spiegel, Sahin Albayrak DAI Labor, Berlin University of Technology, Berlin, Germany {esra.acar, stephan.spiegel, sahin.albayrak}@dai-labor.de ABSTRACT fusion manner by one-class SVM [7] and two-class SVM [1]. We propose an approach for violence analysis of movies in We report and discuss our results on 3 Hollywood movies a multi-modal (visual and audio) manner with one-class from the MediaEval 2011 dataset [2]. and two-class support vector machine (SVM). We use the scale-invariant feature transform (SIFT) features with the 2. PROPOSED APPROACH Bag-of-Words (BoW) approach for visual content descrip- We propose an approach that merges visual and audio tion of movies, where audio content description is performed features in a supervised manner (with one-class and two- with the mel-frequency cepstral coefficients (MFCCs) fea- class SVM) for violence detection in movies. The main idea tures. We investigate the performance of combining visual behind one-class SVM is to construct a hyper-sphere that and audio features in an early fusion manner to describe the contains most of the positive training examples. The hyper- violence in movies. The experimental results suggest that sphere aims to separate the positive training examples from one-class SVM is a promising approach for the task. the rest of the world. The hyper-sphere is determined with two parameters which are v (an upper bound on the fraction Categories and Subject Descriptors of outliers) and σ (the kernel width). Two-class SVM on H.3 [Information Storage and Retrieval]: Content Anal- the other hand constructs a hyperplane in the feature space ysis and Indexing to achieve a good separation between positive and negative examples (i.e. maximum distance between the hyperplane and the nearest training examples of any two class). General Terms For video content description, low-level visual and audio Algorithms, Performance, Experimentation features of video shots of the movies are extracted. The low- level features are then combined in an early fusion manner to Keywords train SVMs. The multi-modal fusion scheme of our approach is given in Figure 1. Violence detection, SVM, SIFT, Bag-of-Words, MFCCs 1. MOTIVATION AND RELATED WORK Although video content analysis has been studied exten- sively in the literature, violence analysis of movies is re- stricted to a few studies [3], [4], [5]. Hence, the motivation of MediaEval 2011 Affect Task is the automatic violence con- tent multi-modal analysis of movies to enable helping par- ents to review the most violent scenes in a movie to prevent their children from watching them. Detailed description of the task, the dataset, the ground truth and evaluation cri- teria are given in the paper by Demarty et al. [2]. Lin et al.[5] proposed a co-training based approach, where audio analysis by a modified pLSA algorithm, motion and Figure 1: Multi-modal Fusion Scheme high-level visual concept analysis was performed. Gong et al. applied a semi-supervised learning approach [4], where low-level visual and audio features were fused with high- 2.1 Audio Features level audio indicators. Giannakopoulos et al. proposed a To describe the audio content of the movies, we use MFCCs multi-modal probabilistic late fusion approach [3]. For the that are commonly used in audio recognition [6]. Due to the MediaEval 2011 Affect Task, we apply multi-modal (audio variability in duration of the annotated video shots, each and visual) analysis as in these studies [3], [4], [5] in an early video shot has different numbers of MFCCs. Since we want to describe each video shot by one significant feature vector, we compute the mean and standard deviation for each di- Copyright is held by the author/owner(s). mension of the MFCCs feature vectors to describe the audio MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy signal. 2.2 Visual Features almost every shot as violent in svm1(cf1:10), because the SIFT based BoW approach is used for visual content de- cost of a miss is ten times higher than the cost of a false scription. As in BoW approaches known e.g. from [8], alarm. In svm1(cf1:1), the number of video shots classified a visual vocabulary is constructed by clustering SIFT lo- as violent gets lower, since the costs of a false alarm and a cal feature vectors with the k -means clustering algorithm. miss are equal. When all of the three runs are considered, Each resulting cluster is treated as a visual word. Once a two-class SVM achieves the poorest performance according visual vocabulary of size k (k = 350 in this work) is built, to the cost measure. each SIFT feature is assigned to the closest visual word (Eu- We observed that one-class SVM tends to classify most of clidean distance is used), a histogram is computed for the the video shots in the movies as violent even if equal costs are keyframe of a video shot and the related video shot is rep- used for false alarm and miss. This may happen because of resented as BoW histogram that represents the visual word two reasons: (1) low-level audio and visual features are not occurrences in its keyframe. selective enough to describe the violence, (2) sub-optimal parameters are being used for SVM model construction. 2.3 Results and Evaluation The aim of this work is to assess the performance of one- 3. CONCLUSIONS class SVM and two-class SVM for violence detection. We We applied one-class and two-class SVM approach for vio- evaluated our approach on 3 Hollywood movies from the lence detection in movies. Our main finding is that one-class MediaEval 2011 dataset [2]. We submitted three runs in SVM seems a promising approach for the task. SVM param- total for the MediaEval 2011 Affect Task: svm1(cf1:10), eters (v, σ) and the low-level audio and visual features used svm1(cf1:1) and svm2(cf1:10). We applied one-class SVM for the task need to be analyzed in more detail for better with RBF kernel in svm1(cf1:10) and svm1(cf1:1) submis- results in terms of false alarm rate. Future work will in- sions, where two-class SVM with RBF kernel was applied in volve enhancing optimal SVM parameter selection process svm2(cf1:10) submission. The cost function mentioned in [2] and more detailed analysis of the audio and visual features was used during SVM parameter selection for svm1(cf1:10) for content description to reduce the false alarm rate. and svm2(cf1:10), where for svm1(cf1:1) submission cost Acknowledgments We wish to thank Brijnesh Johannes function was adapted (i.e. Cfa =1 and Cmiss =1). Param- Jain for his comments and suggestions that greatly con- eter optimization process was performed by separating ran- tributed to the successful completion of this work. domly the training data at hand into training, validation and test sets and the parameter values that gave the min- 4. REFERENCES imum cost according to the mentioned cost function were [1] C. Cortes and V. Vapnik. Support-vector networks. In chosen. The optimized SVM parameters were v and σ for Machine Learning, pages 273–297, 1995. one-class SVM, where c and σ parameters were optimized for two-class SVM. LibSvm1 was used as the SVM implementa- [2] C.-H. Demarty, C. Penet, G. Gravier, and tion. We employed the Auditory Toolbox2 and David Lowe’s M. Soleymani. The MediaEval 2011 Affect Task:Violent SIFT demo software3 to extract the 13-dimensional MFCCs Scenes Detection in Hollywood Movies. In MediaEval and 128-dimensional SIFT features, respectively. Table 1 2011 Workshop, Pisa, Italy, September 1-2 2011. reports the number of false alarms (out of 3871) and miss [3] T. Giannakopoulos, A. Makris, D. Kosmopoulos, detections (out of 629) and Table 2 gives the evaluation re- S. Perantonis, and S. Theodoridis. Audio-visual fusion sults. AED-P, AED-R and AED-F correspond to AED[2] for detecting violent scenes in videos. In Artificial precision, AED recall and AED F-measure, respectively. Intelligence: Theories, Models and Applications, vol. 6040 of Lecture Notes in Computer Science, pages Table 1: Misclassified video shots 91–100, 2010. Run Miss False alarm [4] Y. Gong, W. Wang, S. Jiang, Q. Huang, and G. W. svm1(cf1:10) 18 (%2,86) 3776 (%97,55) Detecting violent scenes in movies by auditory and svm1(cf1:1) 213 (%33,86) 2781 (%71,84) visual cues. In Advances in Multimedia Information svm2(cf1:10) 363 (%57,71) 1350 (%35,71) Processing - PCM 2008, vol. 5353 of Lecture Notes in Computer Science, pages 317–326, 2008. Table 2: Evaluation results for the submitted runs Run AED-P AED-R AED-F AED Cost [5] J. Lin and W. Wang. Weakly-Supervised Violence svm1(cf1:10) 0,1393 0,9714 0,2436 1,262 Detection in Movies with Audio and Video Based Co-training. In Advances in Multimedia Information svm1(cf1:1) 0,1301 0,6614 0,2175 4,105 Processing - PCM 2009, vol. 5879 of Lecture Notes in svm2(cf1:10) 0,1646 0,4229 0,237 6,12 Computer Science, pages 930–935, 2009. [6] B. Logan. Mel frequency cepstral coefficients for music The minimum miss rate is achieved with svm1(cf1:10), modeling. In Int. Symposium on Music Information where svm2(cf1:10) has the minimum false alarm rate. How- Retrieval, 2000. ever, svm2(cf1:10) has the poorest cost value due to the miss [7] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, rate. On the other hand, svm1(cf1:10) and svm1(cf1:1) have and R. C. Williamson. Estimating the support of a smaller miss rates, where their false alarm rate is higher high-dimensional distribution. Neural Computation, compared to svm2(cf1:10). The best cost is achieved with 13(7):1443–1471, 2001. svm1(cf1:10). However, the SVM classifier tends to classify [8] T. Zhang, C. Xu, G. Zhu, S. Liu, and L. H. A Generic 1 http://www.csie.ntu.edu.tw/˜cjlin/libsvm/ Framework for Event Detection in Various Video 2 http://cobweb.ecn.purdue.edu/˜malcolm/interval/1998-010 Domains. In ACM MM, Firenze, Italy, October 25-29 3 http://www.cs.ubc.ca/˜lowe/keypoints/ 2010.