TCS-ILAB - MediaEval 2015: Affective Impact of Movies
                  and Violent Scene Detection

                Rupayan Chakraborty, Avinash Kumar Maurya, Meghna Pandharipande,
                  Ehtesham Hassan, Hiranmay Ghosh and Sunil Kumar Kopparapu
                                         TCS Innovation Labs-Mumbai and Delhi, India
                  {rupayan.chakraborty, avinashkumar.maurya, meghna.pandharipande,
                   ehtesham.hassan, hiranmay.ghosh, sunilkumar.kopparapu}@tcs.com

ABSTRACT                                                           of different features, we apply the D-separation principle [3]
This paper describes the participation of TCS-ILAB in the          for recursive pruning the network as it is not necessary to
MediaEval 2015 Affective Impact of Movies Task (includes           propagate information along every path in the network. This
Violent Scene Detection). We propose to detect the affec-          reduces the computational complexity by a significant level
tive impacts and the violent content in the video clips us-        both for parameter computation and inference. Also, with
ing two different classifications methodologies, i.e. Bayesian     pruned network, we observe a reduced set of features which
Network approach and Artificial Neural Network approach.           effect the values of the queried nodes.
Experiments with different combinations of features make
up for the five run submissions.                                   1.2     Artificial neural network based valence,
                                                                           arousal and violence detection
                                                                     This section describes the system that uses Aritificial Neu-
1.    SYSTEM DESCRIPTION                                           ral Networks (ANN) for classification. Two different method-
                                                                   ologies are employed for the two different subtasks. For both
1.1    Bayesian network based valence, arousal                     subtasks, the developed systems extract the features from
       and violence detection                                      the video shots (including the audio) prior to classification.
   We describe the use of a Bayesian network for the de-
tection task of violence/non-violence, and induced affect.         1.2.1    Feature extraction
Here, we learn the relationship between different attributes          The proposed system uses different set of features, either
of different types of features using a Bayesian network (BN).      from the available feature set (audio, video, and image),
Individual attributes such as Colorfulness, Shot length, or        which was provided with the MediaEval dataset, or from
Zero Crossing etc. form the nodes of BN. This includes the         our own set of extracted audio features. The designed sys-
valence, arousal and violence labels which are treated as cat-     tem either uses audio, image, video features separately, or
egorical attributes. The primary objective of the BN based         a combination of them. The audio features are extracted
approach is to discover the cause-effect relationship between      using the openSMILE toolkit [4], from the audio extracted
different attributes which otherwise is difficult to learn using   from the video shots. openSMILE uses low level descrip-
other learning methods. This analysis helps in gaining the         tors (LLDs), followed by statistical functionals for extracting
knowledge of internal processes of feature generation with         meaningful and informative set of audio features. The fea-
respect to the labels in question, i.e. violence, valence and      ture set contains the following LLDs: intensity, loudness, 12
arousal.                                                           MFCC, pitch (F0), voicing probability, F0 envelope, 8 LSF
   In this work, we have used a publicly available Bayesian        (Line Spectral Frequencies), zero-crossing rate. Delta re-
network learner [1] which gives us the network structure de-       gression coefficients are computed from these LLDs, and the
scribing dependencies between different attributes. Using          following functionals are applied to the LLDs and the delta
the discovered structure, we compute the conditional prob-         coefficients: maximum and minimum value and respective
abilities for the root and its cause attributes. Further, we       relative position within input, range, arithmetic mean, two
perform inferencing for valence, arousal and violence values       linear regression coefficients and linear and quadratic error,
for new observations using the junction-tree algorithm sup-        standard deviation, skewness, kurtosis, quartile, and three
ported in Dlib-ml library [2].                                     inter-quartile ranges. openSMILE, in two different configu-
   As will be shown later, conditional probability compu-          rations, allows extractions of 988 and 384 (which was earlier
tation is a relatively simple task for a network having few        used for Interspeech 2009 Emotion Challenge [5]) audio fea-
nodes which is the case for image features. However, as the        tures. Both of these are reduced to a lower dimension after
attribute set grows, the number parameters namely, con-            feature selection.
ditional probability tables grow exponentially. Considering
that our major focus is on determining the values of violence,     1.2.2    Classification
valence and arousal values with respect to unknown values             For classification, we have used an ANN that is trained
                                                                   with the development set samples available for each of those
                                                                   subtask. As data imbalance exists for the violence detec-
Copyright is held by the author/owner(s).                          tion task (only 4.4% samples are violent), for training, we
MediaEval 2015 Workshop, Sept. 14-15, Wurzen, Germany              have taken equal number of samples from both the classes.
     (a) Image features                     (b) Video features                                      (c) Audio features

                                             Figure 1: Pruned Bayesian Networks.


                                                                          Table 1: MediaEval 2015 results on test set


                                                                         Affective impact (accuracies in %)   Violence detection (MAP)
                                                                         valence impact    arousal impact              violence
                                                                  run1        35.66             44.99                   0.0638
                                                                  run2        34.27             43.88                   0.0638
                                                                  run3        33.89             45.29                   0.0459
                                                                  run4        29.79             48.95                   0.0419
                                                                  run5        33.37             43.97                   0.0553


                                                                    The configurations of five submitted runs (run1-run5) are
                                                                 the same for the two different subtasks. The first two run
                                                                 submissions (run1 and run2) are based on BN, third and
                                                                 fourth (run3 and run4) are based on ANN. The run5 re-
                                                                 sults are obtained using random guess, based on the dis-
Figure 2: Pruned Bayesian Network with combined features.        tribution of the samples in the development set. In run1,
                                                                 we have created a BN with all features (image, video and
                                                                 audio) by merging the networks learned individually using
Therefore, we have multiple ANNs, each of them is trained        image features, video features and audio features respec-
with different set of data. While testing, the test sample is    tively on the complete development data. In run2, a BN
fed to all ANNs, and then the scores from all ANNs out-          is created without audio features by merging the networks
puts are summed using an add rule of combination and the         learned individually using image and video features on the
class that has maximum score is declared winner alogwith         complete development data. In run3 for the violence detec-
a confidence score. Moreover, while working with the test        tion subtask, 19 different ANNs with openSMILE paralin-
dataset, the above mentioned framework is used with differ-      guistic audio features (13 dimensional after feature selection)
ent feature sets. For combining the output of ANNs, two          are trained. In run4 for the violent subtask, we have trained
different methodologies are adopted. In the first, all the       19 different ANNs with 5 different set of features (41 dimen-
scores are added using an add rule before deciding on the        sional MediaEval features, 20 dimensional audio MediaE-
detected class. In the second, the best neural network (selec-   val features, openSMILE audio features (7 dimensional after
tion based on development set) is used for each feature set.     feature selection), openSMILE paralinguistic audio features
Finally, the scores from all the best networks are summed        (13 dimensional), and combination of openSMILE audio and
and the decision is made on the maximum score.                   MediaEval video and image features). So, we have trained
                                                                 19 ∗ 5 = 95 ANNs. The best five ANN classifiers are selected
                                                                 while working on the development set. The development set
2.    EXPERIMENTS                                                is partitioned into 80% and 20% for training and testing,
   The BN network is learned using only the features pro-        respectively. For the affective impact task, run3 and run4,
vided with the MediaEval’s 2015 development set [6] [7]. As      we have trained several ANNs, each with a different feature
given, the violence, valence, and arousal are categorical at-    set.
tributes where violence is a binary variable, and valence and       Table 1 shows the results with the metric proposed in Me-
arousal are distributed in three discrete states. For com-       diaEval 2015 [6]. The best result (i.e. 48.95% accuracy) of
puting the prior probabilities, the remaining attributes of      affective impact detection is obtained with run4 for arousal
the complete development set are quantized into ten levels       detection that combines the best five neural networks for five
that are spaced uniformly. The pruned BN using individ-          different feature sets. And the best result (0.0638 of MAP)
ual features are shown in Figures 1. In Figure 2, we show        for violence detection is obtained in run2 that uses a BN.
the BN obtained by merging the pruned BNs obtained using
individual features.
3.   REFERENCES
[1] A. Shah and P. Woolf, “Python environment for
    Bayesian Learning: Inferring the structure of Bayesian
    Networks from knowledge and data,” Journal of
    Machine Learning Research, vol. 10, pp. 159–162, June
    2009.
[2] D. E. King, “Dlib-ml: A machine learning toolkit,”
    Journal of Machine Learning Research, vol. 10, pp.
    1755–1758, 2009.
[3] D. Koller and N. Friedman, Probabilistic Graphical
    Models: Principles and Techniques - Adaptive
    Computation and Machine Learning. The MIT Press,
    2009.
[4] “opensmile,” 2015. [Online]. Available:
    http://www.audeering.com/research/opensmile
[5] B. W. Schuller, S. Steidl, and A. Batliner, “The
    interspeech 2009 emotion challenge,” in
    INTERSPEECH, 2009, pp. 312–315.
[6] M. Sjoberg, Y. Baveye, H. Wang, V. L. Quang,
    B. Ionescu, E. Dellandrea, M. Schedl, C.-H. Demarty,
    and L. Chen, “The mediaeval 2015 affective impact of
    movies task,” in MediaEval 2015 Workshop, 2015.
[7] Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen,
    “Liris-accede: A video database for affective content
    analysis,” IEEE Transaction on Affective Computing,
    vol. 6, pp. 43–55, January 2015.