<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>BUL in MediaEval 2016 Emotional Impact of Movies Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Asim Jan</string-name>
          <email>asim.jan@brunel.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yona Falinie A. Gaus</string-name>
          <email>yonafalinie.abdgaus@brunel.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fan Zhang</string-name>
          <email>fan.zhang@brunel.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hongying Meng</string-name>
          <email>hongying.meng@brunel.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Electronic and Computer Engineering, Brunel University</institution>
          ,
          <addr-line>London</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>20</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>This paper describes our working approach for the Emotional Impact of Movies task of MediaEval 2016. There are 2 sub-tasks set to make a ective predictions, based on Arousal and Valence values, on video clips. Sub-task 1 requires global emotion prediction. Here a framework is developed using Deep Auto-Encoders, a feature variation algorithm and a Deep network. For sub-task 2, a set of audio features are extracted for continuous emotion prediction. Both sub-tasks are approached as a regression problem evaluated by Mean Squared Error and Pearson Correlation Coe cient.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The 'Emotional Impact of Movies Task' comprises two
sub-tasks with the goal of creating a system that
automatically predicts the emotional impact on video contents, in
terms of Arousal and Valence, which in a 2-D scale can be
used to describe emotions. Sub-task 1 - Global emotion
prediction, predicting a score on induced Valence
(negativepositive) and induced Arousal (calm-excited) for the whole
clip; Sub-task 2 - Continuous emotion prediction,
predicting a score of induced Arousal and Valence continuously for
each 1s-segment of the video. The development dataset used
in both task is the LIRIS-ACCEDE dataset [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. For the rst
sub-task, 9800 video excerpts (around 10s) are provided with
the global Valence and Arousal annotations. For the second
sub-task, 30 movies are provided with the continuous
annotation of Valence and Arousal. Full details on the challenge
tasks and database can be found in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>METHODOLOGY</title>
    </sec>
    <sec id="sec-3">
      <title>Framework Summary for Sub-task 1:</title>
      <p>
        The framework is primarily based on visual cues, with
the use of Deep Learning to bene t from the large sample
video dataset. The content of the videos have many di
erent scenes making the emotion detection process
challenging. To tackle this, a 3 stage framework has been designed.
The rst stage of the framework is a Deep Auto-Encoder,
which can be understood in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This is utilized to try and
give a representation recreated by a Deep Network. It is
trained with all video samples and each image is reproduced
2.1
at frame level to try and obtain common representations
amongst all the videos. It is likely that this will highlight
peoples faces and hide the uncommon scenes and objects.
The second stage observes the decoded features for
variations within a video sample by using Feature Dynamic
History Histogram (FDHH) across the frame level, to produce
a histogram of patterns that summarize and capture these
observations from a set of features. Finally the FDHH
features are used with a regressive model to predict the Arousal
and Valence scales.
2.1.1
      </p>
      <sec id="sec-3-1">
        <title>Stage 1 - Auto-Encoder:</title>
        <p>
          There are two Deep Auto-Encoders trained, one based on
MSE loss and the other on Euclidean loss. Using a similar
architecture of Fig. 1 found in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], the Auto-Encoders both
have the same architecture of 4 convolution (Conv) layers
followed by 4 deconvolution (DeConv) layers. Each of the
Conv and DeConv layers are followed by a Recti ed Linear
Unit (ReLU) activation layer, and at the end is a loss layer.
2.1.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Stage 2 - FDHH Feature:</title>
        <p>
          The FDHH algorithm, based on the idea of Motion
History Histogram (MHH) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], aims to extract temporal
movement across the feature space. This is achieved rstly by
taking the absolute di erence of a feature vector V (n; 1 : c)
representing a frame, and its following frame V (n+1; 1 : c) to
produce D(n 1; 1 : c), where n is the frame sample and c is
the feature dimension. Next, the result calculated of each
dimension from the vector D is compared to a threshold T that
is set by the user to control the amount of variation to
detect, producing a vector of 1's and 0's, that represent above
and below the threshold. This is repeated for all frames
except the last frame, and a new feature set F (1 : N 1; 1 : C)
is produced. Next, each dimension c is observed for patterns
m = 1 : M throughout the feature vector F (1 : N 1; c),
where a histogram is then produced for each de ning
pattern. A pattern can be de ned as the number of consecutive
1's e.g. m = 1 would look for a pattern "`010"', and m = 2
would look for '0110'. The nal FDHH Feature will of
dimensions FDHH(1 : M; 1 : C).
2.1.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Stage 3 - Regression Models:</title>
        <p>The nal stage of the framework is the regression model,
in which 2 are utilized. First being a Deep Network trained
on the FDHH features using MSE and Euclidean Regression
loss function, and the other treats this trained Deep Network
as a Pre-Trained feature extractor, and applies Partial Least
Squares (PLS) on the features to predict the Arousal and
Valence values.</p>
        <p>
          The Deep Network consists of 9 Conv layers, 1 Pooling
layer, 8 ReLu Activation Layers and a Loss Layer at the
end. Training is done using Support Gradient Descent until
100 epochs, with the weights initialized using the Xavier
method [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>The Pre-Trained features are extracted from the 100th
epoch network, which are concatenated with Audio
descriptors that are mentioned in Section 2.2.1, however they are
based on a whole video clip samples rather than 1s segments.
These features are concatenated, Rank Normalized between
0 and 1, and then PLS regression is used.
2.2</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Framework Summary for Sub-task 2:</title>
      <p>2.2.1</p>
      <sec id="sec-4-1">
        <title>Stage 1 - Audio Descriptors:</title>
        <p>
          The Audio descriptors are extracted from openSMILE
software [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] which include 16 low-level descriptor (LLD) as
follows:root mean square (RMS) frame energy;
zero-crossingrate (ZCR); harmonics-to-noise ratio (HNR); pitch frequency
(F0); mel-frequency cepstral coe cients (MFCC) 1-12. For
each LLDs, 12 functionals mean, standard deviation,
kurtosis, skewness, minimum and maximum value, relative
position, and range as well as two linear regression coe cients
with their mean square error (MSE) are also computed. In
total, the number of features per 1s-segment are (16 x 2 x
12) = 384 attributes.
2.2.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Stage 2 - Regression Models:</title>
        <p>A total of 3 regressive models are trained on the audio
descriptors. These are mentioned in the following:</p>
        <p>Run 1 Linear Regression + Gaussian smoothing
(LR+Gs): After obtaining the predicted label from the
regression stage, a smoothing operation is performed, using
Gaussian ltering with a window size of 10. The smoothing
window is carefully selected, in order to retain the pattern
of the labels whilst increasing the performance. It is
required for removing the high frequency noise irrelevant to
the a ective dimensions.</p>
        <p>
          Run 2 Partial Least Square (PLS): PLS is a
statistical algorithm that bears some relation to principal
components regression. Previous EmotiW 2015 employed PLS in
the systems, which gave better results than the baseline [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]
[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>
          Run 3 Least Square Boosting + Moving
Average smoothing (LSB + MAs): LSB is regression model
trained with gradient boosting [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. In this model, the
number of regression trees in the ensemble is chosen as 500 on a
training set. After obtaining the prediction labels, a
smoothing operation is performed using a moving average lter, in
order to increase the performance.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>EXPERIMENTAL SETUP</title>
      <p>5 di erent runs were made based on the framework
Subtask 1, and 3 di erent runs for Sub-task 2, which are:</p>
      <p>Run 1 &amp; Run 2 Deep Audio PLS: These runs
utilize a trained Auto-Encoder with a Euclidean loss function
(EUC Loss) and MSE loss function (MSE Loss), followed by
FDHH feature extraction. Finally trained Deep Networks,
with Euclidean and MSE loss respectively, are used as a
PreTrained Feature Extractors. These features are fused with
Audio features and a PLS regression model is trained.</p>
      <p>Run 3 Audio + PLS: This run is based on using just the
openSMILE Audio descriptors, along with a PLS regression</p>
    </sec>
    <sec id="sec-6">
      <title>RESULTS AND DISCUSSIONS</title>
      <p>On the o cial test results, each of the sub-task were
evaluated using Mean Squared Error (MSE) and Pearson
Correlation Coe cient (PCC). For Sub-Task 1, Table 1, the results
show a strong performance for Run 5, using a trained Deep
Network with Euclidean loss and no Audio descriptors. It is
closely matched by Run 4, the identical con guration using
MSE loss to train the Deep Networks. The Audio
descriptors have shown the weakest performance of them all, with
the possibility of increasing the errors of Runs 1 and 2, as
they use Audio fusion. In terms of Training loss functions
(EUC VS MSE), when comparing Runs 1 VS 2 and Runs
4 VS 5, there is a performance boost for Euclidean loss in
most cases, but only marginal. For Sub-Task 2, Table 2,
PLS gives lowest MSE but LR+Gs gives highest results on
PCC. However all algorithms perform unacceptably bad on
Valence, a situation that requires further investigation.
5.</p>
    </sec>
    <sec id="sec-7">
      <title>CONCLUSIONS</title>
      <p>In this working notes paper, we proposed a di erent
framework for each sub-task. The frameworks are composed on
feature extraction using deep learning, FDHH for capturing
Feature variations across the Deep Features at frame level,
and nally audio descriptors taken from the speech signal.
Several machine learning algorithms were also implemented
as a regression model. The o cial test results show that
features proposed by the framework are informative, give good
results in terms of MSE and PCC in sub-task 1 and good
results in terms of MSE in sub-task 2. The future work will
focus on the dynamic relationship of the emotion data.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Baldi</surname>
          </string-name>
          . Autoencoders,
          <article-title>unsupervised learning, and deep architectures</article-title>
          .
          <source>In Unsupervised and Transfer Learning - Workshop held at ICML</source>
          <year>2011</year>
          , Bellevue, Washington, USA, July
          <volume>2</volume>
          ,
          <year>2011</year>
          , pages
          <fpage>37</fpage>
          {
          <fpage>50</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Baveye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Dellandrea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chamaret</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen.</surname>
          </string-name>
          LIRIS-ACCEDE:
          <article-title>A video database for a ective content analysis</article-title>
          .
          <source>IEEE Transactions on A ective Computing</source>
          ,
          <volume>6</volume>
          (
          <issue>1</issue>
          ):
          <volume>43</volume>
          {
          <fpage>55</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Dellandrea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Baveye</surname>
          </string-name>
          , M. Sjoberg, C. Chamaret, and
          <string-name>
            <given-names>E. C. D.</given-names>
            <surname>Lyon</surname>
          </string-name>
          .
          <source>The MediaEval 2016 Emotional Impact of Movies Task. pages 3{5</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Friedman</surname>
          </string-name>
          .
          <article-title>Stochastic gradient boosting</article-title>
          .
          <source>Computational Statistics and Data Analysis</source>
          ,
          <volume>38</volume>
          (
          <issue>4</issue>
          ):
          <volume>367</volume>
          {
          <fpage>378</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>X.</given-names>
            <surname>Glorot</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>Understanding the di culty of training deep feedforward neural networks</article-title>
          .
          <source>In In Proceedings of the International Conference on Arti cial Intelligence and Statistics (AISTATS10)</source>
          .
          <source>Society for Arti cial Intelligence and Statistics</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Kaya</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Salah</surname>
          </string-name>
          .
          <article-title>Contrasting and Combining Least Squares Based Learners for Emotion Recognition in the Wild Contrasting and Combining Least Squares Based Learners for Emotion Recognition in the Wild</article-title>
          . (November):
          <volume>459</volume>
          {
          <fpage>466</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>Partial least squares regression on grassmannian manifold for emotion recognition</article-title>
          .
          <source>Proceedings of the 15th ACM on International conference on multimodal interaction - ICMI '13</source>
          , pages
          <fpage>525</fpage>
          {
          <fpage>530</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pears</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Freeman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bailey</surname>
          </string-name>
          .
          <article-title>Motion history histograms for human action recognition</article-title>
          .
          <source>In Embedded Computer Vision</source>
          , pages
          <volume>139</volume>
          {
          <fpage>162</fpage>
          .
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F.</given-names>
            <surname>Weninger</surname>
          </string-name>
          .
          <article-title>open-Source Media Interpretation by Large feature-space Extraction</article-title>
          .
          <source>(December)</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Arpit</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Nwogu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Govindaraju</surname>
          </string-name>
          .
          <article-title>Is Joint Training Better for Deep Auto-Encoders? ArXiv e-prints</article-title>
          ,
          <year>May 2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>