=Paper=
{{Paper
|id=Vol-1436/Paper30
|storemode=property
|title=KIT at MediaEval 2015 - Evaluating Visual Cues for Affective Impact of Movies Task
|pdfUrl=https://ceur-ws.org/Vol-1436/Paper30.pdf
|volume=Vol-1436
|dblpUrl=https://dblp.org/rec/conf/mediaeval/PHTS15
}}
==KIT at MediaEval 2015 - Evaluating Visual Cues for Affective Impact of Movies Task==
KIT at MediaEval 2015 – Evaluating Visual Cues for Affective Impact of Movies Task Marin Vlastelica P.*, Sergey Hayrapetyan*, Makarand Tapaswi*, Rainer Stiefelhagen Computer Vision for Human Computer Interaction, Karlsruhe Institute of Technology, Germany marin.vlastelicap@gmail.com, s.hayrapetyan@hotmail.com, tapaswi@kit.edu ABSTRACT In this way, we achieve 5 fairly independent splits for train- We present the approach and results of our system on the ing and testing our models. The splits include differing num- MediaEval Affective Impact of Movies Task. The challenge ber of movies in the training and test sets ranging from 65/35 involves two primary tasks: affect classification and violence to 91/9. detection. We test the performance of multiple visual fea- tures followed by linear SVM classifiers. Inspired by suc- 2.2 Descriptors and models cesses in different vision fields, we use (i) GIST features We focus primarily on simple visual cues to estimate the used in scene modeling, (ii) features extracted from a deep affect of videos and detect violence in them. To this end, we convolutional neural network trained on object recognition, use three feature types and use linear SVM classifiers. and (iii) improved dense trajectory features encoded using For the image-based descriptors, we extract exemplar im- Fisher vectors commonly used in action recognition. ages from the video, sampled at every 10 frames. To com- pensate for shot changes within the video clips we do not 1. INTRODUCTION average the features across the video and use them directly to train our models. The video-level label is assumed to be shared across all images of the clip. As the number of videos grow rapidly, automatically ana- lyzing and indexing them is a topic of growing interest. One GIST We use GIST features that were developed in the interesting area is to analyze the affect such videos have context of scene recognition [7]. We expect these features to on viewers. This can lead to improved recommendation sys- provide good performance on the valence task. The features tems (in case of movies) or help improve overall video search are extracted on each part of an image broken down using performance. Another task is to predict the amount of vio- a 4 × 4 grid to yield a 512 dimensional descriptor. We then lent content in the videos, thus supporting automatic filters train multi-class linear SVM classifiers on these features for for sensitive videos based on viewer age. The MediaEval the affect tasks (arousal and violence) and another linear 2015 task – “Affective Impact of Movies” [9] studies these SVM for the violence detection task. two areas. CNN features Since the ImageNet winning method pro- The affect task is posed as a classification problem on a posed by Krizhevsky, et al. [6] in 2012, deep convolutional two-dimensional arousal-valence plane, where each dimen- neural networks (CNNs) have revolutionized computer vi- sion is discretized to 3 values (classes). On the other hand, sion. These networks have a large number of parameters the detection task is presented as a detection problem. Please and are trained end-to-end (from image to label) using mas- refer to [9] for task and dataset details. sive datasets. The initial layers of the convolution act as low-level feature extractors, while the higher level fully con- 2. APPROACH nected layers start learning about object shapes. In this section we describe the features and classifiers we Inspired by DeCAF [2], we use the BVLC Reference Caf- use to analyze the affective impact of movies. feNet model provided with the Caffe framework [4] as a fea- ture extractor. The model contains 5 convolutional layers, 2.1 Development splits 2 fully connected layers and a soft-max classifier. We use The development set consists of 6144 short video clips ob- the output of the last fully connected layer to obtain 4096 tained from 100 different movies. To analyze the movies dimensional features for the images from video clips. Lin- we use a 5-fold cross-validation on the dataset. The data is ear SVMs are trained on these features for all tasks. Owing split into 5 sets with two goals in mind: (i) the source movies to the complexity of the model and its ability to capture in the training and test splits are different; (ii) the distribu- a large number of variations, we expect these features to tion of class labels (positive/neutral/negative) is maintained perform well for all tasks. close to the original complete set. Improved Dense Trajectories Dense trajectories are an *indicates equal contribution effective descriptor for action recognition. [10] recently pro- posed additional steps to obtain Improved Dense Trajecto- ries (IDT). Unlike dense trajectories, these features estimate Copyright is held by the author/owner(s). and correct camera motion and thus obtain trajectories pri- MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany marily on the foreground moving objects (often human ac- tors). As violence in videos is often characterized by rapid motion, we anticipate these features to work well for violence Table 1: Evaluation on the test set. The first three detection. runs use a single feature, while the latter use late Several descriptors are computed for each trajectory – His- fusion. togram of Oriented Gradients (HOG), Histogram of Opti- Ext. Valence (acc) Arousal (acc) Violence (ap) cal Flow (HOF), Motion Boundary Histogram (MBH) and Run 1 - 35.5 30.8 7.1 overall trajectory characteristics to obtain a 426 dimensional Run 2 - 36.0 46.7 8.6 representation for each trajectory. These features are pro- Run 3 X 38.5 44.7 10.2 jected via PCA to 213 dimensions and finally encoded using state-of-the-art Fisher vector encoding [8]. This results in Run 4 - 35.7 46.7 10.7 a 109,056 dimensional feature representation for the entire Run 5 X 38.5 51.9 12.9 video. Finally, as before, we train linear SVMs using these features. where the final score for each video is a weighted combina- 2.3 Software environment tion of the individual feature predictions. We try a grid of The descriptors and models were developed and trained discrete weights to generate a large number of combinations in Python and Matlab. We used the scikit-learn machine and pick the best scoring model based on cross-validation. learning framework [1] in Python, which uses the liblinear Both fusion schemes perform equal to or better than the SVM library [3] as the backend. For extracting features from single features. For Fusion-1 scheme, we see that IDT fea- deep neural networks we used the Caffe framework [4] which tures get the highest weight, followed by the IAV (dataset provides a simple interface for classification and feature ex- features). In the case of Fusion-2, CNN and IDT features traction with convolutional neural networks (CNN). We ex- are weighted higher. tract IDT features using the provided code and implement Fisher vector encoding in Matlab. Error analysis We present a short analysis of the errors we encountered in the development set. For violent video 3. EVALUATION detection, some of the difficult samples include black-and- white videos with rapid blinking. In case of affect analysis, We now present and discuss the results obtained by the for both valence and arousal classification cartoon scenes visual features. The best classifier parameters were obtained were often deemed colorful and classified as positive (or ac- by cross-validation splits on the dev set. tive) while their ground truth was neutral or negative (or The affect task that includes valence and arousal is treated passive). as a multi-class classification problem (three classes each). The metric for these is the overall class prediction accuracy (acc). The violence task is a detection problem and uses 4. CONCLUSION average precision (ap) to evaluate different methods. We conclude that the CNN features are the best single We present the results of various run submissions in Ta- features for studying the affective impact of movies task. ble 1. The run submissions are as follows: Fine tuning the model, or training a model to perform video • Run 1: GIST features + linear SVMs classification as in [5] could further improve the performance. • Run 2: IDT features + linear SVMs Fusing the models results only in a slight improvement in- • Run 3: CNN features + linear SVMs dicating that using other modalities such as meta-data and • Run 4: Fusion-1 + linear SVMs audio might help improve performance. • Run 5: Fusion-2 + linear SVMs Note that Run 3 and 5 constitute external runs (Ext) since 5. REFERENCES they use pre-trained CNN models. All other submissions are [1] Python scikit-learn: machine learning framework. trained solely on the development data. http://scikit-learn.org/. We see that CNN features (Run 3) outperform the first [2] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, two runs on valence and violence which involve single fea- E. Tzeng, and T. Darrell. DeCAF: A Deep tures. Contrary to expectations, IDT features (Run 2) per- Convolutional Activation Feature for Generic Visual form best on arousal classification. This can be explained Recognition. In International Conference on Machine by passive videos often have very little motion, while active Learning (ICML), 2014. have higher. [3] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, While we expect IDT features to perform well on violence, and C.-J. Lin. LIBLINEAR: A Library for Large videos annotated as violent need not have active motion and Linear Classification. Journal of Machine Learning violence and can often be shots of a post-crime scene. CNN Research, 9:1871–1874, 2008. features seem to work better in this case. [4] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, Fusion runs Run 4 and 5 constitute fusion of different fea- J. Long, R. Girshick, S. Guadarrama, and T. Darrell. tures. Run 4, the Fusion-1 scheme uses the features provided Caffe: Convolutional Architecture for Fast Feature along with the dataset (IAV - image/audio/video concate- Embedding. arXiv preprint arXiv:1408.5093, 2014. nated and trained as one model), GIST and IDT. Run 5, [5] A. Karpathy, G. Toderici, S. Shetty, T. Leung, Fusion-2 scheme includes the above along with CNN fea- R. Sukthankar, and L. Fei-Fei. Large-scale video tures (thus making it an external data run). classification with convolutional neural networks. In In order to fuse the different features, we choose the best IEEE Conference on Computer Vision and Pattern models for each feature type. We then perform late fusion, Recognition (CVPR), 2014. [6] A. Krizhevsky, I. Sutskever, and G. E. Hinton. European Conference on Computer Vision (ECCV), ImageNet Classification with Deep Convolutional 2010. Neural Networks. In Neural Information Processing [9] M. Sjöberg, Y. Baveye, H. Wang, V. L. Quang, Systems (NIPS), 2012. B. Ionescu, E. Dellandréa, M. Schedl, C.-H. Demarty, [7] A. Oliva and A. Torralba. Modeling the Shape of the and L. Chen. The MediaEval 2015 Affective Impact of Scene: A Holistic Representation of the Spatial Movies Task. In MediaEval 2015 Workshop, 2015. Envelope. International Journal of Computer Vision [10] H. Wang and C. Schmid. Action Recognition with (IJCV), 42(3):145–175, 2001. Improved Trajectories. In IEEE International [8] F. Perronnin, J. Sánchez, and T. Mensink. Improving Conference on Computer Vision (ICCV), 2013. the Fisher kernel for large-scale image classification. In