<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TCS-ILAB - MediaEval 2015: Affective Impact of Movies and Violent Scene Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rupayan Chakraborty</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Avinash Kumar Maurya</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Meghna Pandharipande</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ehtesham Hassan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hiranmay Ghosh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sunil Kumar Kopparapu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>TCS Innovation Labs-Mumbai and Delhi</institution>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper describes the participation of TCS-ILAB in the MediaEval 2015 A ective Impact of Movies Task (includes Violent Scene Detection). We propose to detect the a ective impacts and the violent content in the video clips using two di erent classi cations methodologies, i.e. Bayesian Network approach and Arti cial Neural Network approach. Experiments with di erent combinations of features make up for the ve run submissions.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1.1</p>
    </sec>
    <sec id="sec-2">
      <title>SYSTEM DESCRIPTION Bayesian network based valence, arousal and violence detection</title>
      <p>We describe the use of a Bayesian network for the
detection task of violence/non-violence, and induced a ect.
Here, we learn the relationship between di erent attributes
of di erent types of features using a Bayesian network (BN).
Individual attributes such as Colorfulness, Shot length, or
Zero Crossing etc. form the nodes of BN. This includes the
valence, arousal and violence labels which are treated as
categorical attributes. The primary objective of the BN based
approach is to discover the cause-e ect relationship between
di erent attributes which otherwise is di cult to learn using
other learning methods. This analysis helps in gaining the
knowledge of internal processes of feature generation with
respect to the labels in question, i.e. violence, valence and
arousal.</p>
      <p>
        In this work, we have used a publicly available Bayesian
network learner [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] which gives us the network structure
describing dependencies between di erent attributes. Using
the discovered structure, we compute the conditional
probabilities for the root and its cause attributes. Further, we
perform inferencing for valence, arousal and violence values
for new observations using the junction-tree algorithm
supported in Dlib-ml library [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        As will be shown later, conditional probability
computation is a relatively simple task for a network having few
nodes which is the case for image features. However, as the
attribute set grows, the number parameters namely,
conditional probability tables grow exponentially. Considering
that our major focus is on determining the values of violence,
valence and arousal values with respect to unknown values
of di erent features, we apply the D-separation principle [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
for recursive pruning the network as it is not necessary to
propagate information along every path in the network. This
reduces the computational complexity by a signi cant level
both for parameter computation and inference. Also, with
pruned network, we observe a reduced set of features which
e ect the values of the queried nodes.
1.2
      </p>
      <p>Artificial neural network based valence,
arousal and violence detection</p>
      <p>This section describes the system that uses Ariti cial
Neural Networks (ANN) for classi cation. Two di erent
methodologies are employed for the two di erent subtasks. For both
subtasks, the developed systems extract the features from
the video shots (including the audio) prior to classi cation.
1.2.1</p>
      <sec id="sec-2-1">
        <title>Feature extraction</title>
        <p>
          The proposed system uses di erent set of features, either
from the available feature set (audio, video, and image),
which was provided with the MediaEval dataset, or from
our own set of extracted audio features. The designed
system either uses audio, image, video features separately, or
a combination of them. The audio features are extracted
using the openSMILE toolkit [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], from the audio extracted
from the video shots. openSMILE uses low level
descriptors (LLDs), followed by statistical functionals for extracting
meaningful and informative set of audio features. The
feature set contains the following LLDs: intensity, loudness, 12
MFCC, pitch (F0), voicing probability, F0 envelope, 8 LSF
(Line Spectral Frequencies), zero-crossing rate. Delta
regression coe cients are computed from these LLDs, and the
following functionals are applied to the LLDs and the delta
coe cients: maximum and minimum value and respective
relative position within input, range, arithmetic mean, two
linear regression coe cients and linear and quadratic error,
standard deviation, skewness, kurtosis, quartile, and three
inter-quartile ranges. openSMILE, in two di erent con
gurations, allows extractions of 988 and 384 (which was earlier
used for Interspeech 2009 Emotion Challenge [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]) audio
features. Both of these are reduced to a lower dimension after
feature selection.
1.2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Classification</title>
        <p>For classi cation, we have used an ANN that is trained
with the development set samples available for each of those
subtask. As data imbalance exists for the violence
detection task (only 4:4% samples are violent), for training, we
have taken equal number of samples from both the classes.
(a) Image features
(b) Video features
(c) Audio features
Therefore, we have multiple ANNs, each of them is trained
with di erent set of data. While testing, the test sample is
fed to all ANNs, and then the scores from all ANNs
outputs are summed using an add rule of combination and the
class that has maximum score is declared winner alogwith
a con dence score. Moreover, while working with the test
dataset, the above mentioned framework is used with di
erent feature sets. For combining the output of ANNs, two
di erent methodologies are adopted. In the rst, all the
scores are added using an add rule before deciding on the
detected class. In the second, the best neural network
(selection based on development set) is used for each feature set.
Finally, the scores from all the best networks are summed
and the decision is made on the maximum score.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>EXPERIMENTS</title>
      <p>
        The BN network is learned using only the features
provided with the MediaEval's 2015 development set [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. As
given, the violence, valence, and arousal are categorical
attributes where violence is a binary variable, and valence and
arousal are distributed in three discrete states. For
computing the prior probabilities, the remaining attributes of
the complete development set are quantized into ten levels
that are spaced uniformly. The pruned BN using
individual features are shown in Figures 1. In Figure 2, we show
the BN obtained by merging the pruned BNs obtained using
individual features.
      </p>
      <p>The con gurations of ve submitted runs (run1-run5) are
the same for the two di erent subtasks. The rst two run
submissions (run1 and run2) are based on BN, third and
fourth (run3 and run4) are based on ANN. The run5
results are obtained using random guess, based on the
distribution of the samples in the development set. In run1,
we have created a BN with all features (image, video and
audio) by merging the networks learned individually using
image features, video features and audio features
respectively on the complete development data. In run2, a BN
is created without audio features by merging the networks
learned individually using image and video features on the
complete development data. In run3 for the violence
detection subtask, 19 di erent ANNs with openSMILE
paralinguistic audio features (13 dimensional after feature selection)
are trained. In run4 for the violent subtask, we have trained
19 di erent ANNs with 5 di erent set of features (41
dimensional MediaEval features, 20 dimensional audio
MediaEval features, openSMILE audio features (7 dimensional after
feature selection), openSMILE paralinguistic audio features
(13 dimensional), and combination of openSMILE audio and
MediaEval video and image features). So, we have trained
19 5 = 95 ANNs. The best ve ANN classi ers are selected
while working on the development set. The development set
is partitioned into 80% and 20% for training and testing,
respectively. For the a ective impact task, run3 and run4,
we have trained several ANNs, each with a di erent feature
set.</p>
      <p>
        Table 1 shows the results with the metric proposed in
MediaEval 2015 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The best result (i.e. 48.95% accuracy) of
a ective impact detection is obtained with run4 for arousal
detection that combines the best ve neural networks for ve
di erent feature sets. And the best result (0:0638 of MAP)
for violence detection is obtained in run2 that uses a BN.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Shah</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Woolf</surname>
          </string-name>
          , \
          <article-title>Python environment for Bayesian Learning: Inferring the structure of Bayesian Networks from knowledge and data,"</article-title>
          <source>Journal of Machine Learning Research</source>
          , vol.
          <volume>10</volume>
          , pp.
          <volume>159</volume>
          {
          <issue>162</issue>
          ,
          <year>June 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>King</surname>
          </string-name>
          , \
          <article-title>Dlib-ml: A machine learning toolkit,"</article-title>
          <source>Journal of Machine Learning Research</source>
          , vol.
          <volume>10</volume>
          , pp.
          <volume>1755</volume>
          {
          <issue>1758</issue>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Koller</surname>
          </string-name>
          and
          <string-name>
            <given-names>N.</given-names>
            <surname>Friedman</surname>
          </string-name>
          ,
          <article-title>Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning</article-title>
          . The MIT Press,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4] \opensmile,"
          <year>2015</year>
          . [Online]. Available: http://www.audeering.com/research/opensmile
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B. W.</given-names>
            <surname>Schuller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Steidl</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Batliner</surname>
          </string-name>
          , \
          <article-title>The interspeech 2009 emotion challenge,"</article-title>
          <source>in INTERSPEECH</source>
          ,
          <year>2009</year>
          , pp.
          <volume>312</volume>
          {
          <fpage>315</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sjoberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Baveye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. L.</given-names>
            <surname>Quang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          , E. Dellandrea,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Demarty</surname>
          </string-name>
          , and L. Chen, \
          <article-title>The mediaeval 2015 a ective impact of movies task,"</article-title>
          <source>in MediaEval 2015 Workshop</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Baveye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Dellandrea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chamaret</surname>
          </string-name>
          , and L. Chen, \
          <article-title>Liris-accede: A video database for a ective content analysis,"</article-title>
          <source>IEEE Transaction on A ective Computing</source>
          , vol.
          <volume>6</volume>
          , pp.
          <volume>43</volume>
          {
          <issue>55</issue>
          ,
          <year>January 2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>