=Paper=
{{Paper
|id=Vol-1436/Paper37
|storemode=property
|title=RFA at MediaEval 2015 Affective Impact of Movies Task: A Multimodal Approach
|pdfUrl=https://ceur-ws.org/Vol-1436/Paper37.pdf
|volume=Vol-1436
|dblpUrl=https://dblp.org/rec/conf/mediaeval/MironicaISSS15
}}
==RFA at MediaEval 2015 Affective Impact of Movies Task: A Multimodal Approach==
RFA at MediaEval 2015 Affective Impact of Movies Task: A Multimodal Approach Ionuţ Mironică Bogdan Ionescu Mats Sjöberg University Politehnica of University Politehnica of Helsinki Institute for, Bucharest, Romania Bucharest, Romania Information Technology HIIT imironica@imag.pub.ro bionescu@imag.pub.ro University of Helsinki, Finland mats.sjoberg@helsinki.fi Markus Schedl Marcin Skowron Johannes Kepler University, Austrian Research Institute for Linz, Austria Artificial Intelligence, markus.schedl@jku.at Vienna, Austria marcin.skowron@ofai.at ABSTRACT 2.1 Feature set The MediaEval 2015 Affective Impact of Movies Task chal- Visual: We extracted ColorSIFT features [8] using the lenged participants to automatically find violent scenes in a opponent colour space and spatial pyramids with two differ- set of videos and, also, to predict the affective impact that ent sampling strategies: the Harris-Laplace salient point de- video content will have on viewers. We propose the use of tector and dense sampling. We employed the Bag-of-Visual- several multimodal descriptors, such as visual, motion and Words (BoVW) approach where each spatial pyramid parti- auditory features, then we fuse their predictions to detect tion is represented by a 1,000 dimensional histogram over its the violent or affective content. Our best-performing run ColorSIFT features. We also computed the CENsus TRans- with regard to the official metric received a MAP of 0.1419 form hISTogram (CENTRIST) descriptor proposed in [9]. in the violence detection task, and an accuracy of 45.038% In addition, we used a total of four Convolutional Neural for the arousal estimation and 36.123% for the valence esti- Networks (CNN) features, using the protocol laid out in [1]. mation. The used CNNs were trained on either ImageNet 2010 or 2012 training datasets, following as closely as possible the 1. INTRODUCTION network structure parameters of Krizhevsky et al [2]. Fur- thermore, the input images were resized to 256×256 pixels The MediaEval 2015 Affective Impact of Movies Task [6] either by distortion or center cropping, thus giving in total challenged participants to develop algorithms for finding vi- four different CNNs from which we extract four different sets olent scenes in movies. Also, in contrast to previous years, of feature vectors. We use the activations of the first fully- the organizers introduced a completely new subtask for de- connected layer of each network as our features, which re- tecting the emotional impact of movies. The task provided sults in 4096-dimensional feature vectors. Ten regions were a dataset of 10,900 short video clips extracted from 199 Cre- extracted from the test images as suggested in [2] (four cor- ative Commons-licensed movies. Detailed description of the ners, center patch plus flipping) and then a component-wise task, the dataset, the ground truth and evaluation criteria maximum is taken of the region-wise features. are given in the paper by Sjöberg et al. [6]. Auditory: As for audio features, we used descriptors Our system this year is largely based on several multi- provided within the block-level framework [5]. They have modal systems that already obtained good results on similar been proven to be useful for retrieval, classification, and problems [3, 4, 5]. similarity tasks in the audio and music domain. More pre- cisely, we computed for the audio channel of each video its 2. METHOD spectral pattern (considers the cent-scaled spectrum on a Our system builds on a set of visual, motion and auditory 10-frame-basis to characterize frequency and timbre), delta features, combined with a Support Vector Machine (SVM) spectral pattern (computes the difference between the orig- classifier to obtain a violence or an affect score for each video inal spectrum and a copy of the spectrum delayed by 3 document. First, we perform the feature extraction at the frames), variance delta spectral pattern (considers the vari- frame level. The resulting features are aggregated in one ance between the delta spectral blocks), logarithmic fluc- video descriptor using different strategies: the average of fea- tuation pattern (applies several psychoacoustic models and tures, Fisher kernel(FK) [4] or Vector of Locally Aggregated characterizes the amplitude modulations), correlation pat- Descriptors(VLAD) [3]. Finally, the global video descriptors tern (computes Pearson’s correlation between all pairs of 52 are fed into a SVM multi-classifier framework. These steps cent-scaled frequency bands), and spectral contrast pattern are detailed in the following. (computes the difference between spectral peaks and valleys in 20 cent-scaled frequency bands). We eventually end up with each clip being characterized by a 9,448-dimensional Copyright is held by the author/owner(s). feature vector that models its audio content. MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany Table 1: Results for the submitted runs. Violence task Affect task Description MAP Accuracy valence Accuracy arousal run 1 Average on audio descriptors & nonlinear SVM 0.0485 33.032% 45.038% run 2 Average on visual features & nonlinear SVM 0.0452 36.123% 34.104% run 3 Modified VLAD with motion features & linear SVM 0.0768 29.731% 39.865% run 4 Fisher kernel with CNN visual features [2] & linear SVM 0.1419 30.320% 44.365% run 5 Late fusion between all the previous runs 0.0824 29.752% 37.595% Motion: We computed the Histogram of Oriented Gradi- run uses a combination of modified VLAD with motion 3D- ents (3D-HoG) and Histograms of Optical Flows (3D-HoF) HoG/3D-HoF motion features with nonlinear SVM classi- cuboids motion features [7]. First of all, we computed each fiers. In the fourth run, we propose the aggregation of the feature in 3D blocks with a dense sampling strategy: first CNN frame features with the Fisher kernel representation. the gradient magnitude responses in horizontal and vertical Then, we used a linear SVM classifier. Finally, for the fifth directions are computed. Then, for each response the mag- task we performed a late fusion strategy of the first four nitude was quantized in k orientations, where k = 8. Finally, runs. these responses were aggregated over blocks of pixels in both spatial and temporal directions and concatenated. 3.2 Results and discussion Table 1 details the results for all our runs. The third 2.2 Frame aggregation column presents the MAP results obtained on the violence Results from the literature showed that adopting Fisher task, while the next two columns provide the final accuracy kernel [4] and VLAD [3] representations in many video clas- on the second task: the valence and arousal predictions. sification tasks allow for achieving higher accuracy than the Audio features and standard visual features performed use of traditional Bag-of-Words histogram representations. poorly in the violence task. On the other side, the com- This is because these representations capture temporal vari- bination of VLAD with motion features obtained better re- ation over the frames within a video. We used two classical sults. The best results are obtained using Fisher kernel with methods to encode the temporal variation over frame-based CNN visual features. Fusing all the features together did not features, the Fisher Kernel [4] and a modified version of improve the results above the FK-CNN only result. In con- Vector of Locally Aggregated Descriptors [3]. Then, we ag- trast, in the induced affect detection task all combinations gregated the frame features already presented in Section 2.1. perform similarly, except for audio features which have a clearly better result. 2.3 Classifier The final component of the system consists of the data 4. CONCLUSIONS classifier which is fed with the multimodal descriptors com- puted on previous steps. Among the broad choice of existing In this paper, we presented several multimodal approaches classification approaches, we selected a SVM classifier. We for the detection of violent content in movies. We obtained tested several type of kernels, i.e., a fast linear kernel and the best results on the violence task by using motion and vi- two nonlinear kernels: RBF and Chi-Square. While linear sual features. On the other side, we obtained the best results SVMs are very fast in both training and testing, SVMs with on the affect task using the audio features only. The visual / nonlinear kernels are more accurate in many classification motion features obtained lower results for both valence and tasks due to better adaptation to the shape of the clusters arousal predictions. One reason for this is that the visual in the feature space. features do not fit on the purpose of the affect task. It also Finally, in the case of multimodal features, we combine indicates that the affect task is more challenging than the the SVMs output confidence values using max late-fusion violence task. combination: N Acknowledgements CombM ean(d, q) = max cvi (1) i=1 We received support by the Austrian Science Fund (FWF): where cvi is the confidence value of classifier i for class q P25655 and the InnoRESEARCH POSDRU /159/1.5/S/ (q ∈ {1, ..., C}), C represents the number of classes, d is 132395 program. the current video, and N is the number of classifiers to be aggregated. 5. REFERENCES [1] M. Koskela and J. Laaksonen. Convolutional network 3. EXPERIMENTAL RESULTS features for scene recognition. In Proceedings of the 22nd International Conference on Multimedia, Orlando, 3.1 Submitted runs Florida, November 2014. We submitted five runs for both tasks: the violence de- [2] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet tection task and the induced affect detection task. For the classification with deep convolutional neural networks. first run we combined the audio features with a nonlinear In Conference on Neural Information Processing SVM classifier. For the second run, we combined several Systems (NIPS), 2012. visual features (BoVW-ColorSIFT, CENTRIST histograms [3] I. Mironică, I. Duţă, B. Ionescu, and N. Sebe. A and CNN features) with nonlinear SVM classifier. The next Modified Vector of Locally Aggregated Descriptors Approach for Fast Video Classification. Multimedia Tools and Applications (MTAP), 2015. [4] I. Mironică, J. Uijlings, N. Rostamzadeh, B. Ionescu, and N. Sebe. Time Matters! Capturing Variation in Time in Video using Fisher Kernels. In ACM Multimedia, Barcelona, Spain, 21-25 October 2013. [5] K. Seyerlehner, G. Widmer, M. Schedl, and P. Knees. Automatic Music Tag Classification based on Block-Level Features. In Proceedings of the 7th Sound and Music Computing Conference (SMC 2010), Barcelona, Spain, July 2010. [6] M. Sjöberg, Y. Baveye, H. Wang, V. L. Quang, B. Ionescu, E. Dellandréa, M. Schedl, C.-H. Demarty, and L. Chen. The MediaEval 2015 Affective Impact of Movies Task. In MediaEval 2015 Workshop, Wurzen, Germany, September 14-15 2015. [7] J. Uijlings, I. Duta, E. Sangineto, and N. Sebe. Video classification with densely extracted hog/hof/mbh features: an evaluation of the accuracy/computational efficiency trade-off. International Journal of Multimedia Information Retrieval, pages 1–12, 2014. [8] K. E. A. van de Sande, T. Gevers, and C. G. M. Snoek. Evaluating color descriptors for object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 32(9):1582–1596, 2010. [9] J. Wu and J. M. Rehg. CENTRIST: A visual descriptor for scene categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 33(8):1489–1501, 2011.