<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>THUHCSI in MediaEval 2018 Emotional Impact of Movies Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ye Ma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xihao Liang</string-name>
          <email>liangxh16@mails.tsinghua.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mingxing Xu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science &amp; Technology, Tsinghua University</institution>
          ,
          <addr-line>Beijing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>29</fpage>
      <lpage>31</lpage>
      <abstract>
        <p>In this paper we describe our team's approach to the MediaEval 2018 Challenge Emotional Impact of Movies. We extract several sets of audio and visual features, and then apply the time-sequential models such as LSTM and BLSTM to model the continuous flow of emotion in movies. Diferent fusion methods are also considered and discussed. The results show that our methods achieve promising performance, indicating the efectiveness of the features and the models we choose. The Challenge Emotional Impact of Movies of MediaEval has been held since 2015[1, 2, 9]. This challenge mainly focuses on the emotion aroused from the movies and how to predict it. This year's task consists of two subtasks. Subtask 1 aims at Valence / Arousal prediction and Subtask 2 aims at Fear prediction. Details of both subtasks could be found in [3].</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
      <p>In this section, we describe in detail our team’s main approach,
including feature extraction, prediction models, fusion methods,
pre-processing and post-processing.</p>
    </sec>
    <sec id="sec-3">
      <title>Feature extraction</title>
      <p>
        Audio features. Previous results[
        <xref ref-type="bibr" rid="ref6 ref8">6, 8</xref>
        ] have showed the great
potential of the extended Geneva Minimalistic Acoustic
Parameter Set (eGeMAPS) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This feature set contains 23 low level
descriptors (llds), which is proved efective in acoustic tasks such as
speech emotion recognition. In our experiments, we extract the
low level descriptors of eGeMAPS using the OpenSMILE toolbox
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Then we compute the mean and standard deviation in a
centered 5-second-long sliding window of all 23 features to obtain the
feature of 46 dimension for each second of the movie clip.
      </p>
      <p>Besides, baseline features provided by the organizer are also
considered, which is the Emobase 2010 feature set (1582 dimensions).</p>
      <p>Visual features. Baseline features consist of multiple
generalpurpose visual features. Following last year’s experiments, we
concatenate all the visual features to one big feature except the CNN
feature, which is of 1271 dimensions. The CNN feature is treated
separately from other features because it is much larger (4096
dimensions) and has the diferent source from others.</p>
      <p>
        In order to utilize more visual information, we try using
SentiBank for feature extraction. We apply the MVSO detectors[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] on
image frames extracted every one second from the movies to
obtain the final layer of Inception net, which can be referred as the
composition ratio of diferent concepts (4342 dimensions).
      </p>
      <p>All features are scaled to vectors of zero mean and unit variance
for normalization.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Prediction models</title>
      <p>
        Last year’s results [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] showed that the Support Vector Machines
(SVM) are better than Long Short-Term Memory models (LSTM).
However, as the size of the training dataset is larger than that
from last year and time sequential models should perform better on
bigger dataset, this year we adopt LSTM as the prediction model
to predict the emotional flow. In detail, we take the problem as
a Sequence-to-Sequence problem and the time length of input
sequences is determined by the validation set.
      </p>
      <p>This year, we also use the Bidirectional LSTM, which is mainly
for these two reasons: First, the ground truth of emotion is labelled
while the annotators are watching the movies, so the latency and
mismatch of ground truth and movie content must be considered.
Second, the emotional flow in movies is changing smoothly, where
the Bidirectional LSTM could be less afected by the fluctuation of
input features.</p>
      <p>Besides, another diference from last year is that we train models
for valence and arousal together. Considering that both valence
and arousal share similar emotion concept, it is reasonable to use
the same underlying structure. Therefore, every regression model
is trained to predict a two dimensional vector which represents
both valence and arousal.</p>
      <p>As for the Subtask 2, the experiments are done in two steps for
simplicity: First, we train a classification model to predict the
label for every second. Second, we identify a segment as "Fear"
according the labels of every seconds within it. Specifically, we filter
out the seconds whose probability of evoking fear is lower than
the threshold we set and only keep the sequences whose length is
longer than certain threshold, which could remove noise from the
sequence.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Fusion methods</title>
      <p>In our experiments, we apply multiple fusion methods, which are
shown as follows.</p>
      <p>Early fusion: We concatenate features from diferent
modalities and diferent sources to one bigger vector. This method is
simple and straightforward while sometimes very efective.</p>
      <p>Late fusion: We trained several LSTM models simultaneously.
The output of the last layer of these LSTM models are merged
together and used as the input of the next fully-connected layer.</p>
      <p>Average fusion: To avoid over-fitting and reduce noise, we
compute the average of several models’ prediction.</p>
      <p>In addition, we apply a triangle filter of 25 seconds to reduce the
noise of the outputs.</p>
      <sec id="sec-5-1">
        <title>Ye Ma, Xihao Liang, Mingxing Xu</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>EXPERIMENTS AND RESULTS</title>
      <p>In this section, we will elaborate our specific experiment settings
and show the results. Note that all hyper-parameters below such as
sequence length, hidden size, number of layers are all determined
by the validation set. The ratio of training and validation data is
4:1.
3.1</p>
    </sec>
    <sec id="sec-7">
      <title>Subtask 1</title>
      <p>In our experiments on the validation set, it shows that BLSTM
models perform better than LSTM models, which verifies our
assumption. And we also find that BLSTM performs best when the
sequence length is 100. As for the features, we have tested
multiple early fusion combinations and early fusion of Emobase, visual
features (except CNN) and eGeMAPS performs the best. Thus, we
have submitted 5 runs for subtask 1 all using BLSTM models whose
sequence length is 100, and the input features of them are all the
same. The first three runs only difer in the number of BLSTM
layers, which is 4, 2 and 3 respectively. Run 4 is the average fusion of
the first three runs. Run 5 is the late fusion of two BLSTM models,
of which the inputs are Emobase and visual features (except CNN)
respectively. All runs are trained using a dropout probability of 0.5
to avoid over-fitting.</p>
      <p>From Table 1 we can see that the best run of valence is Run
3, which is a 2-layer BLSTM model using Emobase, visual features
(except CNN) and eGeMAPS as inputs. As for arousal, Run 4 achieves
best performance in MSE, which indicates average fusion
sometimes enhances the performance to some extent. The result of
valence prediction is remarkably better than that of arousal
prediction. This is probably because arousal is harder to predict than
valence.
3.2</p>
    </sec>
    <sec id="sec-8">
      <title>Subtask 2</title>
      <p>As for subtask 2, we try to use the method discussed in Section
2.2. However, it performs much worse than expected. Due to the
problem of imbalanced dataset, the prediction probability of fear
is very low and only a few segments of consecutive seconds are
predicted as "fear". Some movies in development set even have no
"fear" segments. It shows that LSTM models may not be proper for
imbalanced problem. We’ve also tried to use techniques for
imbalanced problem, such as down-sampling movies and adding more
weight for positive samples. Nevertheless, these methods hardly
work. Owing to time constraints, we didn’t submit runs for this
subtask finally, and we will continue researching in future work.
4</p>
    </sec>
    <sec id="sec-9">
      <title>DISCUSSION AND OUTLOOK</title>
      <p>In summary, this year we’ve further studied the Emotional Impact
of Movies task and discovered some useful insights. Firstly,
temporal models such as LSTM and BLSTM can capture more
information in time sequential problems, when given enough training
data. And BLSTM models could be less afected by the latency and
mismatch between annotations and movies, which perform better
than single directional LSTM. As for fusion methods, early fusion
and average fusion are both simple and intuitive, but they usually
have a good performance.</p>
      <p>Still, some problems remain to be solved. SentiBank features are
not so useful as expected in this task. More and more CNN related
features should be extracted and tested. Arousal is much harder to
predict than valence in our experiments, which needs further
investigation. For subtask 2, the problem of imbalanced dataset still
remains unsolved this year, even though the evaluation metric has
been changed to intersection over union. In addition, some novel
techniques from other domains such as object segmentation and
voice activity detection could be applied to this subtask to
handle this new metric. Moreover, adding more fear related movies to
dataset could be another efective approach to alleviate the
imbalanced problem.</p>
      <p>In conclusion, this paper illustrates our approach to the
MediaEval 2018 Challenge Emotional Impact of Movies task. We’ve trained
BLSTM models using multi-modality features and several fusion
methods, which achieves promising performance in valence and
arousal prediction task. Fear prediction task is not fully solved and
remains to be further investigated.</p>
    </sec>
    <sec id="sec-10">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was partially supported by the National Natural Science
Foundation of China (61433018, 61171116) and the National High
Technology Research and Development Program of China (863
program) (2015AA016305).</p>
      <sec id="sec-10-1">
        <title>Emotional Impact of Movies Task MediaEval’18, 29-31 October 2018, Sophia Antipolis, France</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Emmanuel</given-names>
            <surname>Dellandréa</surname>
          </string-name>
          , Liming Chen, Yoann Baveye, Mats Sjöberg, and
          <string-name>
            <given-names>Christel</given-names>
            <surname>Chamaret</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>The MediaEval 2016 Emotional Impact of Movies Task</article-title>
          .
          <source>In Proceedings of MediaEval 2016 Workshop</source>
          . Hilversum, Netherlands.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Emmanuel</given-names>
            <surname>Dellandréa</surname>
          </string-name>
          , Martijn Huigsloot, Liming Chen, Yoann Baveye, and
          <string-name>
            <given-names>Mats</given-names>
            <surname>Sjöberg</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>The MediaEval 2017 Emotional Impact of Movies Task</article-title>
          .
          <source>In Proceedings of MediaEval 2017 Workshop</source>
          . Dublin, Ireland.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Emmanuel</given-names>
            <surname>Dellandréa</surname>
          </string-name>
          , Martijn Huigsloot, Liming Chen, Yoann Baveye, Zhongzhe Xiao, and
          <string-name>
            <given-names>Mats</given-names>
            <surname>Sjöberg</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>The MediaEval 2018 Emotional Impact of Movies Task</article-title>
          .
          <source>In Proceedings of MediaEval 2018 Workshop</source>
          . Sophia Antipolis, France.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Florian</given-names>
            <surname>Eyben</surname>
          </string-name>
          , Klaus Scherer, Khiet Truong, Bjorn Schuller, Johan Sundberg, Elisabeth Andre, Carlos Busso, Laurence Devillers, Julien Epps, and
          <string-name>
            <given-names>Petri</given-names>
            <surname>Laukka</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Afective Computing</article-title>
          .
          <source>IEEE Transactions on Afective Computing</source>
          <volume>12</volume>
          ,
          <issue>2</issue>
          (
          <year>2016</year>
          ),
          <fpage>190</fpage>
          -
          <lpage>202</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Florian</given-names>
            <surname>Eyben</surname>
          </string-name>
          , Felix Weninger, Florian Gross, and
          <string-name>
            <given-names>Björn</given-names>
            <surname>Schuller</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Recent developments in openSMILE, the munich open-source multimedia feature extractor</article-title>
          .
          <source>In Proceedings of the 21st ACM international conference on Multimedia. ACM</source>
          ,
          <volume>835</volume>
          -
          <fpage>838</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Zitong</given-names>
            <surname>Jin</surname>
          </string-name>
          , Yuqi Yao, Ye Ma, and
          <string-name>
            <given-names>Mingxing</given-names>
            <surname>Xu</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>THUHCSI in MediaEval 2017 Emotional Impact of Movies Task</article-title>
          .
          <source>In Proceedings of MediaEval 2017 Workshop</source>
          . Dublin, Ireland.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Brendan</given-names>
            <surname>Jou</surname>
          </string-name>
          , Tao Chen, Nikolaos Pappas, Miriam Redi, Mercan Topkara, and
          <string-name>
            <surname>Shih-Fu Chang</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Visual afect around the world: A large-scale multilingual visual sentiment ontology</article-title>
          .
          <source>In Proceedings of the 23rd ACM international conference on Multimedia. ACM</source>
          ,
          <volume>159</volume>
          -
          <fpage>168</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Ye</given-names>
            <surname>Ma</surname>
          </string-name>
          , Zipeng Ye, and
          <string-name>
            <given-names>Mingxing</given-names>
            <surname>Xu</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>THU-HCSI at MediaEval 2016: Emotional Impact of Movies Task</article-title>
          .
          <source>In Proceedings of MediaEval 2016 Workshop</source>
          . Hilversum, Netherlands.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Mats</given-names>
            <surname>Sjöberg</surname>
          </string-name>
          , Yoann Baveye, Hanli Wang, Vu Lam Quang, Bogdan Ionescu, Emmanuel Dellandréa, Markus Schedl,
          <string-name>
            <surname>Claire-Hélène Demarty</surname>
            , and
            <given-names>Liming</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>The MediaEval 2015 Afective Impact of Movies Task.</article-title>
          .
          <source>In Proceedings of MediaEval 2015 Workshop</source>
          . Wurzen, Germany.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>