TCS-ILAB - MediaEval 2015: Affective Impact of Movies and Violent Scene Detection Rupayan Chakraborty, Avinash Kumar Maurya, Meghna Pandharipande, Ehtesham Hassan, Hiranmay Ghosh and Sunil Kumar Kopparapu TCS Innovation Labs-Mumbai and Delhi, India {rupayan.chakraborty, avinashkumar.maurya, meghna.pandharipande, ehtesham.hassan, hiranmay.ghosh, sunilkumar.kopparapu}@tcs.com ABSTRACT of different features, we apply the D-separation principle [3] This paper describes the participation of TCS-ILAB in the for recursive pruning the network as it is not necessary to MediaEval 2015 Affective Impact of Movies Task (includes propagate information along every path in the network. This Violent Scene Detection). We propose to detect the affec- reduces the computational complexity by a significant level tive impacts and the violent content in the video clips us- both for parameter computation and inference. Also, with ing two different classifications methodologies, i.e. Bayesian pruned network, we observe a reduced set of features which Network approach and Artificial Neural Network approach. effect the values of the queried nodes. Experiments with different combinations of features make up for the five run submissions. 1.2 Artificial neural network based valence, arousal and violence detection This section describes the system that uses Aritificial Neu- 1. SYSTEM DESCRIPTION ral Networks (ANN) for classification. Two different method- ologies are employed for the two different subtasks. For both 1.1 Bayesian network based valence, arousal subtasks, the developed systems extract the features from and violence detection the video shots (including the audio) prior to classification. We describe the use of a Bayesian network for the de- tection task of violence/non-violence, and induced affect. 1.2.1 Feature extraction Here, we learn the relationship between different attributes The proposed system uses different set of features, either of different types of features using a Bayesian network (BN). from the available feature set (audio, video, and image), Individual attributes such as Colorfulness, Shot length, or which was provided with the MediaEval dataset, or from Zero Crossing etc. form the nodes of BN. This includes the our own set of extracted audio features. The designed sys- valence, arousal and violence labels which are treated as cat- tem either uses audio, image, video features separately, or egorical attributes. The primary objective of the BN based a combination of them. The audio features are extracted approach is to discover the cause-effect relationship between using the openSMILE toolkit [4], from the audio extracted different attributes which otherwise is difficult to learn using from the video shots. openSMILE uses low level descrip- other learning methods. This analysis helps in gaining the tors (LLDs), followed by statistical functionals for extracting knowledge of internal processes of feature generation with meaningful and informative set of audio features. The fea- respect to the labels in question, i.e. violence, valence and ture set contains the following LLDs: intensity, loudness, 12 arousal. MFCC, pitch (F0), voicing probability, F0 envelope, 8 LSF In this work, we have used a publicly available Bayesian (Line Spectral Frequencies), zero-crossing rate. Delta re- network learner [1] which gives us the network structure de- gression coefficients are computed from these LLDs, and the scribing dependencies between different attributes. Using following functionals are applied to the LLDs and the delta the discovered structure, we compute the conditional prob- coefficients: maximum and minimum value and respective abilities for the root and its cause attributes. Further, we relative position within input, range, arithmetic mean, two perform inferencing for valence, arousal and violence values linear regression coefficients and linear and quadratic error, for new observations using the junction-tree algorithm sup- standard deviation, skewness, kurtosis, quartile, and three ported in Dlib-ml library [2]. inter-quartile ranges. openSMILE, in two different configu- As will be shown later, conditional probability compu- rations, allows extractions of 988 and 384 (which was earlier tation is a relatively simple task for a network having few used for Interspeech 2009 Emotion Challenge [5]) audio fea- nodes which is the case for image features. However, as the tures. Both of these are reduced to a lower dimension after attribute set grows, the number parameters namely, con- feature selection. ditional probability tables grow exponentially. Considering that our major focus is on determining the values of violence, 1.2.2 Classification valence and arousal values with respect to unknown values For classification, we have used an ANN that is trained with the development set samples available for each of those subtask. As data imbalance exists for the violence detec- Copyright is held by the author/owner(s). tion task (only 4.4% samples are violent), for training, we MediaEval 2015 Workshop, Sept. 14-15, Wurzen, Germany have taken equal number of samples from both the classes. (a) Image features (b) Video features (c) Audio features Figure 1: Pruned Bayesian Networks. Table 1: MediaEval 2015 results on test set Affective impact (accuracies in %) Violence detection (MAP) valence impact arousal impact violence run1 35.66 44.99 0.0638 run2 34.27 43.88 0.0638 run3 33.89 45.29 0.0459 run4 29.79 48.95 0.0419 run5 33.37 43.97 0.0553 The configurations of five submitted runs (run1-run5) are the same for the two different subtasks. The first two run submissions (run1 and run2) are based on BN, third and fourth (run3 and run4) are based on ANN. The run5 re- sults are obtained using random guess, based on the dis- Figure 2: Pruned Bayesian Network with combined features. tribution of the samples in the development set. In run1, we have created a BN with all features (image, video and audio) by merging the networks learned individually using Therefore, we have multiple ANNs, each of them is trained image features, video features and audio features respec- with different set of data. While testing, the test sample is tively on the complete development data. In run2, a BN fed to all ANNs, and then the scores from all ANNs out- is created without audio features by merging the networks puts are summed using an add rule of combination and the learned individually using image and video features on the class that has maximum score is declared winner alogwith complete development data. In run3 for the violence detec- a confidence score. Moreover, while working with the test tion subtask, 19 different ANNs with openSMILE paralin- dataset, the above mentioned framework is used with differ- guistic audio features (13 dimensional after feature selection) ent feature sets. For combining the output of ANNs, two are trained. In run4 for the violent subtask, we have trained different methodologies are adopted. In the first, all the 19 different ANNs with 5 different set of features (41 dimen- scores are added using an add rule before deciding on the sional MediaEval features, 20 dimensional audio MediaE- detected class. In the second, the best neural network (selec- val features, openSMILE audio features (7 dimensional after tion based on development set) is used for each feature set. feature selection), openSMILE paralinguistic audio features Finally, the scores from all the best networks are summed (13 dimensional), and combination of openSMILE audio and and the decision is made on the maximum score. MediaEval video and image features). So, we have trained 19 ∗ 5 = 95 ANNs. The best five ANN classifiers are selected while working on the development set. The development set 2. EXPERIMENTS is partitioned into 80% and 20% for training and testing, The BN network is learned using only the features pro- respectively. For the affective impact task, run3 and run4, vided with the MediaEval’s 2015 development set [6] [7]. As we have trained several ANNs, each with a different feature given, the violence, valence, and arousal are categorical at- set. tributes where violence is a binary variable, and valence and Table 1 shows the results with the metric proposed in Me- arousal are distributed in three discrete states. For com- diaEval 2015 [6]. The best result (i.e. 48.95% accuracy) of puting the prior probabilities, the remaining attributes of affective impact detection is obtained with run4 for arousal the complete development set are quantized into ten levels detection that combines the best five neural networks for five that are spaced uniformly. The pruned BN using individ- different feature sets. And the best result (0.0638 of MAP) ual features are shown in Figures 1. In Figure 2, we show for violence detection is obtained in run2 that uses a BN. the BN obtained by merging the pruned BNs obtained using individual features. 3. REFERENCES [1] A. Shah and P. Woolf, “Python environment for Bayesian Learning: Inferring the structure of Bayesian Networks from knowledge and data,” Journal of Machine Learning Research, vol. 10, pp. 159–162, June 2009. [2] D. E. King, “Dlib-ml: A machine learning toolkit,” Journal of Machine Learning Research, vol. 10, pp. 1755–1758, 2009. [3] D. Koller and N. Friedman, Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. The MIT Press, 2009. [4] “opensmile,” 2015. [Online]. Available: http://www.audeering.com/research/opensmile [5] B. W. Schuller, S. Steidl, and A. Batliner, “The interspeech 2009 emotion challenge,” in INTERSPEECH, 2009, pp. 312–315. [6] M. Sjoberg, Y. Baveye, H. Wang, V. L. Quang, B. Ionescu, E. Dellandrea, M. Schedl, C.-H. Demarty, and L. Chen, “The mediaeval 2015 affective impact of movies task,” in MediaEval 2015 Workshop, 2015. [7] Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen, “Liris-accede: A video database for affective content analysis,” IEEE Transaction on Affective Computing, vol. 6, pp. 43–55, January 2015.