=Paper=
{{Paper
|id=Vol-2283/MediaEval_18_paper_22
|storemode=property
|title=THUHCSI in MediaEval 2018 Emotional Impact of Movies Task
|pdfUrl=https://ceur-ws.org/Vol-2283/MediaEval_18_paper_22.pdf
|volume=Vol-2283
|authors=Ye Ma,Xihao Liang,Mingxing Xu
|dblpUrl=https://dblp.org/rec/conf/mediaeval/MaLX18
}}
==THUHCSI in MediaEval 2018 Emotional Impact of Movies Task==
<pdf width="1500px">https://ceur-ws.org/Vol-2283/MediaEval_18_paper_22.pdf</pdf>
<pre>
    THUHCSI in MediaEval 2018 Emotional Impact of Movies Task
                                                             Ye Ma1 , Xihao Liang1 , Mingxing Xu1
                            1 Department of Computer Science & Technology, Tsinghua University, Beijing, China

                        ma-y17@mails.tsinghua.edu.cn,liangxh16@mails.tsinghua.edu.cn,xumx@tsinghua.edu.cn

ABSTRACT                                                                         image frames extracted every one second from the movies to ob-
In this paper we describe our team’s approach to the MediaEval                   tain the final layer of Inception net, which can be referred as the
2018 Challenge Emotional Impact of Movies. We extract several sets               composition ratio of different concepts (4342 dimensions).
of audio and visual features, and then apply the time-sequential                    All features are scaled to vectors of zero mean and unit variance
models such as LSTM and BLSTM to model the continuous flow of                    for normalization.
emotion in movies. Different fusion methods are also considered
and discussed. The results show that our methods achieve promis-                 2.2    Prediction models
ing performance, indicating the effectiveness of the features and                Last year’s results [6] showed that the Support Vector Machines
the models we choose.                                                            (SVM) are better than Long Short-Term Memory models (LSTM).
                                                                                 However, as the size of the training dataset is larger than that
                                                                                 from last year and time sequential models should perform better on
1     INTRODUCTION                                                               bigger dataset, this year we adopt LSTM as the prediction model
                                                                                 to predict the emotional flow. In detail, we take the problem as
The Challenge Emotional Impact of Movies of MediaEval has been                   a Sequence-to-Sequence problem and the time length of input se-
held since 2015[1, 2, 9]. This challenge mainly focuses on the emo-              quences is determined by the validation set.
tion aroused from the movies and how to predict it. This year’s                     This year, we also use the Bidirectional LSTM, which is mainly
task consists of two subtasks. Subtask 1 aims at Valence / Arousal               for these two reasons: First, the ground truth of emotion is labelled
prediction and Subtask 2 aims at Fear prediction. Details of both                while the annotators are watching the movies, so the latency and
subtasks could be found in [3].                                                  mismatch of ground truth and movie content must be considered.
                                                                                 Second, the emotional flow in movies is changing smoothly, where
2     APPROACH                                                                   the Bidirectional LSTM could be less affected by the fluctuation of
In this section, we describe in detail our team’s main approach,                 input features.
including feature extraction, prediction models, fusion methods,                    Besides, another difference from last year is that we train models
pre-processing and post-processing.                                              for valence and arousal together. Considering that both valence
                                                                                 and arousal share similar emotion concept, it is reasonable to use
2.1     Feature extraction                                                       the same underlying structure. Therefore, every regression model
   Audio features. Previous results[6, 8] have showed the great                  is trained to predict a two dimensional vector which represents
potential of the extended Geneva Minimalistic Acoustic Parame-                   both valence and arousal.
ter Set (eGeMAPS) [4]. This feature set contains 23 low level de-                   As for the Subtask 2, the experiments are done in two steps for
scriptors (llds), which is proved effective in acoustic tasks such as            simplicity: First, we train a classification model to predict the la-
speech emotion recognition. In our experiments, we extract the                   bel for every second. Second, we identify a segment as "Fear" ac-
low level descriptors of eGeMAPS using the OpenSMILE toolbox                     cording the labels of every seconds within it. Specifically, we filter
[5]. Then we compute the mean and standard deviation in a cen-                   out the seconds whose probability of evoking fear is lower than
tered 5-second-long sliding window of all 23 features to obtain the              the threshold we set and only keep the sequences whose length is
feature of 46 dimension for each second of the movie clip.                       longer than certain threshold, which could remove noise from the
   Besides, baseline features provided by the organizer are also con-            sequence.
sidered, which is the Emobase 2010 feature set (1582 dimensions).
                                                                                 2.3    Fusion methods
   Visual features. Baseline features consist of multiple general-               In our experiments, we apply multiple fusion methods, which are
purpose visual features. Following last year’s experiments, we con-              shown as follows.
catenate all the visual features to one big feature except the CNN                  Early fusion: We concatenate features from different modali-
feature, which is of 1271 dimensions. The CNN feature is treated                 ties and different sources to one bigger vector. This method is sim-
separately from other features because it is much larger (4096 di-               ple and straightforward while sometimes very effective.
mensions) and has the different source from others.                                 Late fusion: We trained several LSTM models simultaneously.
   In order to utilize more visual information, we try using Sen-                The output of the last layer of these LSTM models are merged to-
tiBank for feature extraction. We apply the MVSO detectors[7] on                 gether and used as the input of the next fully-connected layer.
                                                                                    Average fusion: To avoid over-fitting and reduce noise, we
Copyright held by the owner/author(s).                                           compute the average of several models’ prediction.
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                          In addition, we apply a triangle filter of 25 seconds to reduce the
                                                                                 noise of the outputs.
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                                                 Ye Ma, Xihao Liang, Mingxing Xu

           Table 1: Results of Subtask 1 on test set                    4 DISCUSSION AND OUTLOOK
                                                                        In summary, this year we’ve further studied the Emotional Impact
                        Valence            Arousal                      of Movies task and discovered some useful insights. Firstly, tem-
           Runs
                    MSE           r     MSE          r                  poral models such as LSTM and BLSTM can capture more infor-
                                                                        mation in time sequential problems, when given enough training
          Run 1     0.1021   0.1714    0.1414    0.0870                 data. And BLSTM models could be less affected by the latency and
          Run 2     0.1036   0.1820    0.1399    -0.0181                mismatch between annotations and movies, which perform better
                                                                        than single directional LSTM. As for fusion methods, early fusion
          Run 3    0.0924    0.3048    0.1399    0.0761
                                                                        and average fusion are both simple and intuitive, but they usually
          Run 4     0.0980   0.2422    0.1396    0.0612                 have a good performance.
          Run 5     0.0944   0.2511    0.1460    -0.0667                   Still, some problems remain to be solved. SentiBank features are
                                                                        not so useful as expected in this task. More and more CNN related
                                                                        features should be extracted and tested. Arousal is much harder to
3     EXPERIMENTS AND RESULTS                                           predict than valence in our experiments, which needs further in-
In this section, we will elaborate our specific experiment settings     vestigation. For subtask 2, the problem of imbalanced dataset still
and show the results. Note that all hyper-parameters below such as      remains unsolved this year, even though the evaluation metric has
sequence length, hidden size, number of layers are all determined       been changed to intersection over union. In addition, some novel
by the validation set. The ratio of training and validation data is     techniques from other domains such as object segmentation and
4:1.                                                                    voice activity detection could be applied to this subtask to han-
                                                                        dle this new metric. Moreover, adding more fear related movies to
3.1    Subtask 1                                                        dataset could be another effective approach to alleviate the imbal-
                                                                        anced problem.
In our experiments on the validation set, it shows that BLSTM
                                                                           In conclusion, this paper illustrates our approach to the Media-
models perform better than LSTM models, which verifies our as-
                                                                        Eval 2018 Challenge Emotional Impact of Movies task. We’ve trained
sumption. And we also find that BLSTM performs best when the
                                                                        BLSTM models using multi-modality features and several fusion
sequence length is 100. As for the features, we have tested multi-
                                                                        methods, which achieves promising performance in valence and
ple early fusion combinations and early fusion of Emobase, visual
                                                                        arousal prediction task. Fear prediction task is not fully solved and
features (except CNN) and eGeMAPS performs the best. Thus, we
                                                                        remains to be further investigated.
have submitted 5 runs for subtask 1 all using BLSTM models whose
sequence length is 100, and the input features of them are all the
same. The first three runs only differ in the number of BLSTM lay-      ACKNOWLEDGMENTS
ers, which is 4, 2 and 3 respectively. Run 4 is the average fusion of   This work was partially supported by the National Natural Science
the first three runs. Run 5 is the late fusion of two BLSTM models,     Foundation of China (61433018, 61171116) and the National High
of which the inputs are Emobase and visual features (except CNN)        Technology Research and Development Program of China (863 pro-
respectively. All runs are trained using a dropout probability of 0.5   gram) (2015AA016305).
to avoid over-fitting.
   From Table 1 we can see that the best run of valence is Run
3, which is a 2-layer BLSTM model using Emobase, visual features
                                                                        REFERENCES
(except CNN) and eGeMAPS as inputs. As for arousal, Run 4 achieves       [1] Emmanuel Dellandréa, Liming Chen, Yoann Baveye, Mats Sjöberg,
best performance in MSE, which indicates average fusion some-                and Christel Chamaret. 2016. The MediaEval 2016 Emotional Impact
                                                                             of Movies Task. In Proceedings of MediaEval 2016 Workshop. Hilver-
times enhances the performance to some extent. The result of va-
                                                                             sum, Netherlands.
lence prediction is remarkably better than that of arousal predic-       [2] Emmanuel Dellandréa, Martijn Huigsloot, Liming Chen, Yoann Bav-
tion. This is probably because arousal is harder to predict than va-         eye, and Mats Sjöberg. 2017. The MediaEval 2017 Emotional Impact
lence.                                                                       of Movies Task. In Proceedings of MediaEval 2017 Workshop. Dublin,
                                                                             Ireland.
3.2    Subtask 2                                                         [3] Emmanuel Dellandréa, Martijn Huigsloot, Liming Chen, Yoann Bav-
As for subtask 2, we try to use the method discussed in Section              eye, Zhongzhe Xiao, and Mats Sjöberg. 2018. The MediaEval 2018
2.2. However, it performs much worse than expected. Due to the               Emotional Impact of Movies Task. In Proceedings of MediaEval 2018
                                                                             Workshop. Sophia Antipolis, France.
problem of imbalanced dataset, the prediction probability of fear
                                                                         [4] Florian Eyben, Klaus Scherer, Khiet Truong, Bjorn Schuller, Johan
is very low and only a few segments of consecutive seconds are               Sundberg, Elisabeth Andre, Carlos Busso, Laurence Devillers, Julien
predicted as "fear". Some movies in development set even have no             Epps, and Petri Laukka. 2016. The Geneva Minimalistic Acoustic Pa-
"fear" segments. It shows that LSTM models may not be proper for             rameter Set (GeMAPS) for Voice Research and Affective Computing.
imbalanced problem. We’ve also tried to use techniques for imbal-            IEEE Transactions on Affective Computing 12, 2 (2016), 190–202.
anced problem, such as down-sampling movies and adding more              [5] Florian Eyben, Felix Weninger, Florian Gross, and Björn Schuller.
weight for positive samples. Nevertheless, these methods hardly              2013. Recent developments in openSMILE, the munich open-source
work. Owing to time constraints, we didn’t submit runs for this              multimedia feature extractor. In Proceedings of the 21st ACM interna-
subtask finally, and we will continue researching in future work.            tional conference on Multimedia. ACM, 835–838.
Emotional Impact of Movies Task                                             MediaEval’18, 29-31 October 2018, Sophia Antipolis, France


[6] Zitong Jin, Yuqi Yao, Ye Ma, and Mingxing Xu. 2017. THUHCSI in
    MediaEval 2017 Emotional Impact of Movies Task. In Proceedings of
    MediaEval 2017 Workshop. Dublin, Ireland.
[7] Brendan Jou, Tao Chen, Nikolaos Pappas, Miriam Redi, Mercan Top-
    kara, and Shih-Fu Chang. 2015. Visual affect around the world: A
    large-scale multilingual visual sentiment ontology. In Proceedings of
    the 23rd ACM international conference on Multimedia. ACM, 159–168.
[8] Ye Ma, Zipeng Ye, and Mingxing Xu. 2016. THU-HCSI at MediaEval
    2016: Emotional Impact of Movies Task. In Proceedings of MediaEval
    2016 Workshop. Hilversum, Netherlands.
[9] Mats Sjöberg, Yoann Baveye, Hanli Wang, Vu Lam Quang, Bogdan
    Ionescu, Emmanuel Dellandréa, Markus Schedl, Claire-Hélène De-
    marty, and Liming Chen. 2015. The MediaEval 2015 Affective Impact
    of Movies Task.. In Proceedings of MediaEval 2015 Workshop. Wurzen,
    Germany.

</pre>