The ICL-TUM-PASSAU Approach for the MediaEval 2015 “Affective Impact of Movies” Task George Trigeorgis1 , Eduardo Coutinho1 , Fabien Ringeval2,3 , Erik Marchi2 , Stefanos Zafeiriou1 , Björn Schuller1,3 1 Department of Computing, Imperial College London, UK 2 Machine Intelligence & Signal Processing Group, Technische Universität München, Munich, Germany 3 Chair of Complex & Intelligent Systems, University of Passau, Germany g.trigeorgis@imperial.ac.uk ABSTRACT The emotional impact of videos can be heavily influenced In this paper we describe the Imperial College London, by the kind of objects present in a given scene [11, 12, Technische Universität München and University of Passau 15]. We thus computed a probability of 1000 different ob- (ICL+TUM+PASSAU) team approach to the MediaEval’s jects to be present in a frame using a pretrained 16-layer “Affective Impact of Movies” challenge, which consists in convolutional neural network (CNN) on the ILSVRC2013 the automatic detection of affective (arousal and valence) dataset [21, 4]. Let x ∈ RN ×p represents a video of the and violent content in movie excerpts. In addition to the database with N frames and p pixels per frame, and f (·) baseline features, we computed spectral and energy related the trained convolutional neural net with softmax activa- acoustic features, and the probability of various objects be- tion functions in the output layer. The probability P r(y = ing present in the video. Random Forests, AdaBoost and c|xi ; θ) for each of the 1000 classes being present inside the Support Vector Machines were used as classification meth- i-th frame of a video xi is obtained by forwarding the p ods. Best results show that the dataset is highly challenging pixels value through the network. By averaging the activa- for both affect and violence detection tasks, mainly because tions over all the N frames of a video sequence we obtained of issues in inter-rater agreement and data scarcity. the probability distribution of the 1000 ILSVRC2013 classes that might be present in the video. 1. INTRODUCTION Classifiers. The MediaEval 2015 Challenge “Affective Impact of For modelling the data we concentrated on two out-of-the- Movies” comprises two subtasks using the LIRIS-ACCEDE box ensemble techniques: Random Forests and AdaBoost. database [2]. Subtask 1 targets the automatic categorisa- We used these two techniques as they are less susceptible tion of videos in terms of their affective impact. The goal to the overfitting problem than other learning algorithms is to identify the arousal (calm-neutral-excited) and valence due to the combination of weak learners, they are trivial to (negative-neutral-positive) levels of each video. The goal optimise as they have only one hyper-parameter, and they of Subtask 2 is to identify those videos that contain violent usually provide close or on par results with the state-of- scenes. The full description of the tasks can be found in [22]. the-art for a multitude of tasks [9, 10, 23, 14]. The hyper- parameters for each classifier were determined using a 5-fold 2. METHODOLOGY cross-validation scheme on the development set. During de- 2.1 Subtask 1: affect classification velopment the best performance was achieved with 10 trees with Random Forests and 20 trees with AdaBoost. Feature sets. In our work we have used both the baseline features pro- Runs. vided by the organisers [2], as well as our own sets of audio- We submitted a total of five runs. Run 1 consisted of pre- visual features as described below. dictions using the baseline features and the AdaBoost model. The extended Geneva Minimalistic Acoustic Parameter The predictions in runs 2 and 5 were obtained using the Set (eGeMAPS) was used to extract acoustic features with baseline plus our audio-visual feature sets and the Random the openSMILE toolkit [6]; this feature set was designed as a Forest and AdaBoost classifiers, respectively. By looking standard acoustic parameter set for automatic speech emo- at the distribution of labels in the development set, we ob- tion recognition [5, 18, 16] and has also been successfully served that the most common combinations of labels are: 1) used for other paralinguistic tasks [17]. The eGeMAPS neutral valence (V n ) and negative arousal (A− ) (24%), and comprises a total of 18 Low-Level Descriptors (LLDs), in- 2) positive valence (V + ) and negative arousal (A− ) (20%). cluding frequency, energy/amplitude, and spectral related Runs 3 and 4 are thus based on the hypothesis that the label features. Various functionals were then applied to the LLDs distribution of the test set will be similarly unbalanced. In over the whole instance, giving raise to a total of 88 features. run 3 every clip was predicted to be V n , A+ and in Run 4 every one was V + , A− . These submissions act as a san- ity check of our own models, but also other competitors’ submissions for this competition. Copyright is held by the author/owner(s). MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany 2.2 Subtask 2: violence detection Subtask 1. Our results for the affective task indicate that we did not do much better than was expected by chance for Feature sets. arousal classification, and did slightly better than chance According to previous work [7, 13], we only considered for valence in run 5; we thus refrain from further interpreta- spectral and energy based features as acoustic descriptors. tion of results. This can be explained by the low quality of Indeed, violent segments do not necessarily contain speech; the provided annotations for the dataset. The initial anno- voice specific features, such as voice quality and pitch re- tations had a low inter-rater agreement [2], and there were lated descriptors, might thus not be a reliable source of in- multiple processing stages afterwards [1, 22] with high levels formation for violence. We extracted 22 acoustic low-level of uncertainty and unclear validity. descriptors (LLDs): loudness, alpha ratio, Hammarberg’s Subtask 2. Results show that there is an important over- index, energy slope and proportion in the bands [0 − 500] Hz fitting in our models as the performance is divided by a fac- and [500 − 1500] Hz, and 14 MFCCs, using the openSMILE tor of 2 between development and test partitions. This is, toolkit [6]. All LLDs, with the exception of loudness and however, not really surprising since only 272 instances la- the measures of energy proportion, were computed sepa- belled as violent were available as training data. Moreover, rately for voiced and unvoiced segments. As the frames the labelling task being performed not at the frame level of the movie that contain violent scenes are unknown, we but rather at the excerpt level does not allow to model pre- computed 5 functionals (max, min, range, arithmetic mean cisely the information that is judged as violent, making the and standard-deviation) to summarise the LLDs over the task highly challenging. We can nevertheless observe that movie excerpt, which provided a total of 300 features. For the proposed audio-visual feature set brings a large improve- the video modality, we used the same additional features de- ment over the baseline feature set – the MAP is improved fined in Subtask 1. We also used the metadata information by a factor superior to 2, and that the inclusion of the movie of the video genre as an additional feature, due to depen- genre as additional feature also allows a small improvement dencies between movie genre and violent content. in the performance. Classifier. Since the dataset is strongly imbalanced – only 272 ex- Subtask 1 Subtask 2 cerpts out of 6,144 are labelled as violent – we up-sampled Run Arousal (AC) Valence (AC) Violence (MAP) the violent instances to achieve a balanced distribution. All features were furthermore standardised with a z-score. As 1 55.72 39.99 4.9 classifier, we used the libsvm implementation of Support 2 54.71 41.00 13.3 Vector Machines (SVMs) [3] and optimised the complexity 3 55.55 37.87 13.5 parameter, and the γ coefficient of the radial basis kernel 4 55.55 29.02 14.9 in a 5-folds cross-validation framework on the development 5 54.46 41.48 13.9 set. Because the official scoring script requires the com- putation of a posteriori probabilities, which is more time Table 1: The submission results for the arousal, va- consuming than the straightforward classification task, we lence, and violence classification tasks on the test optimised the Unweighted Average Recall (UAR) to find partition. AC stands for accuracy and MAP for the the best hyper-parameters [19, 20], and then re-trained the mean average precision. SVMs with the probability estimates. Runs. 4. CONCLUSIONS We first performed experiments with the full baseline fea- We have presented our approach to the MediaEval’s “Af- ture set and found that the addition of the movie genre fective Impact of Movies” challenge, which consists in the as feature improved the Mean Average Precision (MAP) automatic detection of affective and violent content in movie from 19.5 to 20.3, despite degrading the UAR from 72.3 to excerpts. Our results for the affective task have shown that 72.0. Adding our own audio-visual features provided a jump we did not do much better than a classifier that is based in the performance with the MAP reaching 33.6 and UAR on chance, although we use features and classifiers that are 77.6. Because some movie excerpts contain partly relevant known to work well in the literature for arousal and valence acoustic information, we empirically defined a threshold on prediction [2, 8]. We consider that this might be owed to a loudness based on the histogram, to exclude frames before potentially noisiness of the annotations provided. As for the computing the functionals. This procedure has improved violence prediction subtask, the results show that we over- the MAP to 35.9 but downgraded the UAR to 76.9. A fine fit a lot on the development set, which is not very striking tuning of the complexity parameter and γ coefficient yielded given the small amount of instances of the minority class. the best performance in terms of UAR with a value of 78.0, The analysis of violent content at the excerpt level is also but slightly deteriorated the MAP to 35.7. highly challenging, because only few frames might contain We submitted a total of five runs. Run 1 – baseline fea- violence, and such brief information is almost totally lost in tures; Run 2 – all features mentioned above (except movie the computation of functionals at the full excerpt level. genre) with loudness threshold (0.038); Run 3 – same as Run 2 plus the inclusion of movie genre; Run 4 – as Run 3 but 5. ACKNOWLEDGEMENTS with a fine tuning of the hyper-parameters; Run 5 – similar The research leading to these results has received funding to Run 3 but with a higher threshold for loudness (0.078). from the EC’s Seventh Framework Programme through the ERC Starting Grant No. 338164 (iHEARu), and the EU’s 3. RESULTS Horizon 2020 Programme through the Innovative Action Our official results on the test set for both subtasks are No. 644632 (MixedEmotions), No. 645094 (SEWA) and the shown in Table 1. Research Innovative Action No. 645378 (ARIA-VALUSPA). 6. REFERENCES Multimedia Retrieval (ICMR 2013), pages 215–222, [1] Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen. Dallas (TX), USA, 2013. From crowdsourced rankings to affective ratings. In [14] J.-J. Lee, P.-H. Lee, S.-W. Lee, A. Yuille, and Proceedings of the IEEE International Conference on C. Koch. Adaboost for text detection in natural scene. Multimedia and Expo Workshops (ICMEW 2014), In Proceedings of the IEEE 12th International pages 1–6, 2014. Conference on Document Analysis and Recognition [2] Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen. (ICDAR 2013), pages 429–434, Beijing, China, 2013. LIRIS-ACCEDE: A Video Database for Affective [15] I. Lopatovskaa and I. Arapakis. Theories, methods Content Analysis. IEEE Transactions on Affective and current research on emotions in library and Computing, 6(1):43–55, January-March 2015. information science, information retrieval and [3] C.-C. Chang and C.-J. Lin. LIBSVM: A library for human–computer interaction. Information Processing support vector machines. ACM Transactions on & Management, 47(4):575–592, July 2011. Intelligent Systems and Technology, 2(3), April 2011. [16] F. Ringeval, S. Amiriparian, F. Eyben, K. Scherer, [4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and and B. Schuller. Emotion recognition in the wild: L. Fei-Fei. Imagenet: A large-scale hierarchical image Incorporating voice and lip activity in multimodal database. In Proceedings of the IEEE Conference on decision-level fusion. In Proceedings of the 2nd Computer Vision and Pattern Recognition (CVPR Emotion Recognition In The Wild Challenge and 2009), pages 248–255, 2009. Workshop (EmotiW 2014), pages 473–480, Istanbul, [5] F. Eyben, K. Scherer, B. Schuller, J. Sundberg, Turkey, September 2014. E. André, C. Busso, L. Devillers, J. Epps, P. Laukka, [17] F. Ringeval, E. Marchi, M. Mehu, K. Scherer, and S. Narayanan, and K. Truong. The Geneva B. Schuller. Face reading from speech – predicting minimalistic acoustic parameter set (GeMAPS) for facial action units from audio cues. In Proceedings of voice research and affective computing. IEEE INTERSPEECH 2015, 16th Annual Conference of the Transactions on Affective Computing, In press, 2015. International Speech Communication Association [6] F. Eyben, F. Weninger, F. Groß, and B. Schuller. (ISCA), to appear, Dresden, Germany, September Recent Developments in openSMILE, the Munich 2015. Open-Source Multimedia Feature Extractor. In [18] F. Ringeval, B. Schuller, M. Valstar, S. Jaiswal, Proceedings of the 21st ACM International Conference E. Marchi, D. Lalanne, R. Cowie, and M. Pantic. on Multimedia (MM 2013), pages 835–838, Barcelona, AV+EC 2015 – The First Affect Recognition Spain, October 2013. Challenge Bridging Across Audio, Video, and [7] F. Eyben, F. Weninger, N. Lehment, and B. Schuller. Physiological Data. In Proceedings of the 5th Affective Video Retrieval: Violence Detection in International Workshop on Audio/Visual Emotion Hollywood Movies by Large-Scale Segmental Feature Challenge (AVEC), ACM MM, Brisbane, Australia, Extraction. PLOS one, 8(12):1–12, December 2013. October 2015. [8] K. Forbes-Riley and D. J. Litman. Predicting emotion [19] B. Schuller, S. Steidl, A. Batliner, J. Epps, F. Eyben, in spoken dialogue from multiple knowledge sources. F. Ringeval, E. Marchi, and Y. Zhang. The In Proceedings of the Conference of the North INTERSPEECH 2014 Computational Paralinguistics American Chapter of the Association for Challenge: Cognitive & Physical Load. In Proceedings Computational Linguistics - Human Language of INTERSPEECH 2014, 15th Annual Conference of Technologies (HLT-NAACL), pages 201–208, 2004. the International Speech Communication Association (ISCA), pages 427–431, Singapore, Republic of [9] F. G., J. Gall, and V. G. L. Real time head pose Singapore, September 2014. estimation with random regression forests. In Proceedings of the IEEE International Conference on [20] B. Schuller, S. Steidl, A. Batliner, S. Hantke, Computer Vision and Pattern Recognition (CVPR), F. Hönig, J. R. Orozco-Arroyave, E. Nöth, Y. Zhang, pages 617–624, Providence (RI), USA, June 2011. and F. Weninger. The INTERSPEECH 2015 Computational Paralinguistics Challenge: Nativeness, [10] L. Guo, N. Chehata, C. Mallet, and S. Boukir. Parkinson’s & Eating Condition. In Proceedings of Relevance of airborne lidar and multispectral image INTERSPEECH 2015, 16th Annual Conference of the data for urban scene classification using Random International Speech Communication Association Forests. ISPRS Journal of Photogrammetry and (ISCA), Dresden, Germany, September 2015. Remote Sensing, 66(1):56–66, January 2011. [21] K. Simonyan and A. Zisserman. Very deep [11] A. Hanjalic and L.-Q. Xu. Affective video content convolutional networks for large-scale image representation and modeling. IEEE Transactions on recognition. arXiv preprint arXiv:1409.1556, 2014. Multimedia, 7(1):143–154, 2005. [22] M. Sjöberg, Y. Baveye, H. Wang, V. Quand, [12] W. Hu, N. Xie, L. Li, X. Zeng, and M. S. A survey on B. Ionescu, E. Dellandréa, M. Schedl, C.-H. Demarty, visual content-based video indexing and retrieval. and L. Chen. The MediaEval 2015 Affective Impact of IEEE Transactions on Systems, Man, and Movies Task. In Proceedings of the MediaEval 2015 Cybernetics, Part C: Applications and Reviews, Workshop, Wurzen, Germany, 2015. 41(6):797–819, October 2011. [23] A. Stumpf and N. Kerle. Object-oriented mapping of [13] B. Ionescu, J. Schlüter, I. Mironica, and M. Schedl. A landslides using random forests. Remote Sensing of naive mid-level concept-based fusion approach to Environment, 115(10):2564–2577, October 2011. violence detection in hollywood movies. In Proceedings of the 3rd ACM International Conference on