=Paper=
{{Paper
|id=Vol-1984/Mediaeval_2017_paper_26
|storemode=property
|title=THUHCSI in MediaEval 2017 Emotional Impact of Movies Task
|pdfUrl=https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_26.pdf
|volume=Vol-1984
|authors=Zitong Jin,Yuqi Yao,Ye Ma,Mingxing Xu
|dblpUrl=https://dblp.org/rec/conf/mediaeval/JinYMX17
}}
==THUHCSI in MediaEval 2017 Emotional Impact of Movies Task==
<pdf width="1500px">https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_26.pdf</pdf>
<pre>
    THUHCSI in MediaEval 2017 Emotional Impact of Movies Task
                                                      Zitong Jin, Yuqi Yao, Ye Ma, Mingxing Xu
                                     Key Laboratory of Pervasive Computing, Ministry of Education
                            Tsinghua National Laboratory for Information Science and Technology (TNList)
                          Department of Computer Science and Technology, Tsinghua University, Beijing, China
                                {jzt15,yaoyq15,ma-y17}@mails.tsinghua.edu.cn,xumx@tsinghua.edu.cn

ABSTRACT                                                                      All input features are scaled into vectors of zero mean and unit
In this paper we describe our team’s approach to MediaEval 2017             variance for normalization.
Challenge Emotional Impact of Movies. Except for the baseline fea-             Prediction models. Two aspects of models are adopted in our
tures, we use OpenSMILE toolbox to extract audio features eGeMAPS           experiments, which are traditional machine learning models and
from video clips. We also aim at the continuous flow of emotion,            time-sequential models. Specifically, the traditional models con-
where using time-sequential models such as LSTM will be useful              sist of Support Vector Regression (SVR) and AdaBoost while the
and effective. Fusion methods are also considered and discussed             time-sequential ones are Long-Short Term Memory (LSTM) mod-
in this paper. The evaluation results of our experiments show that          els. The LSTM models may capture the emotional flow of video
our features and models are competitive in both valence / arousal           and enhance the performance. We take the problem as a Sequence-
and fear prediction, indicating our approaches’ effectiveness.              to-One regression problem and the input features of LSTM models
                                                                            are segmented in a 10-second-long sliding window of 5 seconds
1     INTRODUCTION                                                          overlapping.
The MediaEval 2017 Challenge Emotional Impact of Movies con-                   All models are trained separately for valence and arousal.
sists of two subtasks. Subtask 1 aims at Valence/Arousal predic-                Fusion methods. To combine features of different modalities,
tion while subtask 2 aims at Fear prediction. Long movies are con-          except for the early fusion method which simply concatenates dif-
sidered for both cases and prediction needs to be given every 5             ferent features, late fusion method is also considered. As for the
seconds for the consecutive ten seconds’ segment. LIRIS-ACCEDE              traditional prediction models, average fusion is used to avoid over-
[1, 2] dataset is used for training and testing, including both dis-        fitting. As for the LSTM models, the hidden vectors of several LSTM
crete and continuous sections of data. For more details, please refer       models taking different inputs are fused using an one-layer fully-
to [5].                                                                     connected network to obtain final prediction, which is trained with
   Video affective analysis and prediction is an important and chal-        LSTM models simultaneously.
lenging issue, which has drawn the attention of many researchers                After fusion, to reduce the fluctuation of output and smooth out
recently. The Emotional Impact of Movies task has been held for             the random noise, a 25-frame-long triangle filter is applied to each
three years, so there are many participants who took part in the            video.
challenge in 2015 and 2016 [4, 9].
                                                                            2.2    Subtask 2: Fear prediction
2     APPROACH                                                                  Feature extraction. We use the same feature sets as Subtask 1.
In this section, we will describe the main approaches for the sub-          However, the main problem and the biggest challenge in Subtask 2
tasks, including feature extraction, pre-processing, prediction mod-        is that the samples are so unbalanced that simply predicting “zero”
els, fusion and post-processing methods.                                    obtains the accuracy score of 84.34% in the test set (see Run 4).
                                                                            Therefore, to solve the unbalanced problem, SMOTE (Synthetic Mi-
2.1     Subtask 1: valence / arousal prediction                             nority Over-sampling TEchnique [3]) method is adopted after fea-
   Feature extraction. Except for the baseline features provided            ture extraction to re-sample. The main idea of SMOTE algorithm is
by the organizers, the extended Geneva Minimalistic Acoustic Pa-            to generate new samples for minorities using interpolation, which
rameter Set (eGeMAPS) [6] is extracted from audio channel, which            will make it more balanced.
contains 88 features and has been proved effective in the same
                                                                               Prediction models. Random Forest model is adopted in fear
task of last year [8]. In our experiments, we extract them with the
                                                                            prediction, which may behave better than Support Vector Machine
OpenSMILE toolkit [7] from 5-second-long segments which are cut
                                                                            (SVM) in unbalanced problem. We first use Random Forest model
from original videos in advance.
                                                                            to obtain the probability of predicting fear (“one”) for each video
   As for the visual features, the general purpose visual features
                                                                            clip. Then we set up the decision threshold p, and predict fear when
provided by the organizers (except CNN features) are merged into
                                                                            the probability is larger than p. The value of p are adjusted accord-
one large feature. This is mainly on account of the fact that these
                                                                            ing to the validation set’s results. Due to the time constaints, we
features are short and complementary, and that combining them
                                                                            didn’t try the LSTM model for Subtask 2.
can greatly reduce the training workload to try every one of them.
Copyright held by the owner/author(s).
                                                                               Fusion methods. Similar to Subtask 1, both early and late fu-
MediaEval’17, 13-15 September 2017, Dublin, Ireland                         sion are used. In late fusion, the probability of different models are
                                                                            averaged to get one probability.
MediaEval’17, 13-15 September 2017, Dublin, Ireland                                               Zitong Jin, Yuqi Yao, Ye Ma, Mingxing Xu

           Table 1: Results of Subtask 1 on test set                                Table 2: Results of Subtask 2 on test set

                         Valence            Arousal                               Runs    Accuracy     Precision    Recall       f1
           Runs
                     MSE           r      MSE         r                          Run 1      0.7352       0.0206     0.0530    0.0239
           Run 1    0.2230    -0.0985    0.1577    0.2261                        Run 2      0.8153       0.2318     0.2781    0.2352
           Run 2    0.1670    -0.0990    0.1269    -0.0122                       Run 3      0.8461       0.2035     0.0208    0.0371
           Run 3    0.1833    0.3707     0.1166    0.3213                        Run 4      0.8434       0.0000     0.0000    0.0000
           Run 4    0.2074    -0.0111    0.1318    0.2708                        Run 5      0.8469      0.2383      0.2186    0.2165
           Run 5    0.2046     0.0122    0.1300    0.2750

                                                                         3.2    Subtask 2: fear prediction
3     EXPERIMENTS AND RESULTS
                                                                         We’ve submitted 5 runs for fear prediction, all using Random Forest
In this section, we will describe our specific runs in more detail and   model, listed below:
show the results. Note that all the hyper-parameters are selected           Run 1: Random Forest + other visual features.
due to the results of validation set, and the ratio of training data        Run 2: Random Forest + VGG.
and validation data is 4:1.                                                 Run 3: Random Forest + all visual features.
                                                                            Run 4: All predicting “zero” (just for test)
3.1    Subtask 1: valence / arousal prediction                              Run 5: Late fusion of Run 1 and Run 2.
We’ve submitted 5 runs for valence / arousal prediction, where the          From Table 2 we can see that, Run 2 using VGG features achieve
first two use LSTM and the other ones use SVR and AdaBoost, all          best results on recall and f1, while Run 5 using late fusion achieve
listed below:                                                            best results on accuracy and precision. As mentioned before, the
    Run 1: For valence, 2-layer LSTM model of hidden size 500 tak-       problem of subtask 2 is very unbalanced, and the fear samples are
ing eGeMAPS as input; For arousal, 3-layer LSTM model of hidden          much fewer. Therefore, there is no surprise that accuracy and pre-
size 500 taking VGG as input.                                            cision are one pair while recall and f1 are the other pair. Predicting
    Run 2: For valence, late fusion of three 2-layer LSTM models of      more “zeros” will lead to higher accuracy while lower recall, and
hidden size 1000 taking eGeMAPS, VGG and other visual features           vice versa.
as input respectively; For arousal, the input features are Emobase,         When considering f1 score, which is the harmonic mean of both
eGeMAPS and CEDD respectively.                                           precision and recall, Run 2 using VGG feature performs best, which
    Run 3: For both valence and arousal, SVR model taking VGG as         confronts with the result of subtask 1 that CNN features contain
input.                                                                   useful information for emotion analysis.
    Run 4: For valence, AdaBoost model taking eGeMAPS as input;
For arousal, AdaBoost model taking other visual features as input.       4 CONCLUSION AND DISCUSSION
    Run 5: For both valence and arousal, late fusion of Run 3 and
                                                                         In this paper, we illustrate our approach to the MediaEval 2017
Run 4.
                                                                         Challenge “Emotional Impact of Movies” task. In valence / arousal
    In detail, the “other visual features” in Run 2 and 4 means the
                                                                         prediction subtask, both LSTM and SVR models are trained and
concatenation of all the visual features except the CNN feature.
                                                                         compared. In fear prediction subtask, Random Forest model using
CEDD means Color and Edge Directivity Descriptor, which is one
                                                                         different features are compared. Besides, early fusion and late fu-
of the visual feature provided. VGG means CNN features extracted
                                                                         sion are adopted in experiments, which shows promising results
using VGG16 fc6 layer.
                                                                         in some aspects.
    From Table 1 we can see that, the best run of valence MSE is
                                                                             However, some problems have not been solved yet. For instance,
Run 2, using late fusion of LSTM models. Run 3 achieves the best
                                                                         some of the LSTM models tend to predict similar values of all time,
results on other metrics, using SVR model and VGG feature. No-
                                                                         leading to a very low Pearson’s r, which may be caused by inappro-
tice that Run 2, the LSTM late fusion method, is better at MSE
                                                                         priate experiment configuration. Unbalanced problem in subtask 2
than Run 1, the single LSTM model, which means late fusion of
                                                                         still exists even using SMOTE algorithm, which means changing
three models utilizes different information in different features and
                                                                         models or features could make no big difference, and all predicting
enhances the performance to some extent. However, LSTM mod-
                                                                         “zero” can still obtain a very high accuracy. These problems remain
els perform worse in Pearson’s r, compared to traditional machine
                                                                         to be solved in the future.
learning models. This could be because LSTM models tend to pre-
dict similar values of all time, and thus obtain lower MSE and lower
Pearson’s r.                                                             ACKNOWLEDGMENTS
    Taken together, Run 3 using SVR and VGG achieves best results,       This work was partially supported by the National High Technol-
which means CNN features may contain useful information for              ogy Research and Development Program of China (863 program)
emotion analysis, and traditional model could behave well when           (2015AA016305) and the National Natural Science Foundation of
trained properly.                                                        China (61433018, 61171116).
Emotional Impact of Movies Task                                              MediaEval’17, 13-15 September 2017, Dublin, Ireland


REFERENCES
[1] Yoann Baveye, Emmanuel Dellandréa, Christel Chamaret, and Lim-
    ing Chen. 2015. Deep learning vs. kernel methods: Performance for
    emotion prediction in videos. In Affective Computing and Intelligent
    Interaction (ACII), 2015 International Conference on. IEEE, 77–83.
[2] Yoann Baveye, Emmanuel Dellandrea, Christel Chamaret, and Liming
    Chen. 2015. LIRIS-ACCEDE: A video database for affective content
    analysis. IEEE Transactions on Affective Computing 6, 1 (2015), 43–55.
[3] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip
    Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling tech-
    nique. Journal of artificial intelligence research 16 (2002), 321–357.
[4] Emmanuel Dellandréa, Liming Chen, Yoann Baveye, Mats Sjöberg,
    and Christel Chamaret. 2016. The MediaEval 2016 Emotional Impact
    of Movies Task. In Proceedings of MediaEval 2016 Workshop. Hilver-
    sum, Netherlands.
[5] Emmanuel Dellandréa, Martijn Huigsloot, Liming Chen, Yoann Bav-
    eye, and Mats Sjöberg. 2017. The MediaEval 2017 Emotional Impact
    of Movies Task. In Proceedings of MediaEval 2017 Workshop. Dublin,
    Ireland.
[6] Florian Eyben, Klaus Scherer, Khiet Truong, Bjorn Schuller, Johan
    Sundberg, Elisabeth Andre, Carlos Busso, Laurence Devillers, Julien
    Epps, and Petri Laukka. 2016. The Geneva Minimalistic Acoustic Pa-
    rameter Set (GeMAPS) for Voice Research and Affective Computing.
    IEEE Transactions on Affective Computing 12, 2 (2016), 190–202.
[7] Florian Eyben, Felix Weninger, Florian Gross, and Björn Schuller.
    2013. Recent developments in openSMILE, the munich open-source
    multimedia feature extractor. In Proceedings of the 21st ACM interna-
    tional conference on Multimedia. ACM, 835–838.
[8] Ye Ma, Zipeng Ye, and Mingxing Xu. 2016. THU-HCSI at MediaEval
    2016: Emotional Impact of Movies Task. In Proceedings of MediaEval
    2016 Workshop. Hilversum, Netherlands.
[9] Mats Sjöberg, Yoann Baveye, Hanli Wang, Vu Lam Quang, Bogdan
    Ionescu, Emmanuel Dellandréa, Markus Schedl, Claire-Hélène De-
    marty, and Liming Chen. 2015. The MediaEval 2015 Affective Impact
    of Movies Task.. In Proceedings of MediaEval 2015 Workshop. Wurzen,
    Germany.

</pre>