=Paper=
{{Paper
|id=Vol-1984/Mediaeval_2017_paper_26
|storemode=property
|title=THUHCSI in MediaEval 2017 Emotional Impact of Movies Task
|pdfUrl=https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_26.pdf
|volume=Vol-1984
|authors=Zitong Jin,Yuqi Yao,Ye Ma,Mingxing Xu
|dblpUrl=https://dblp.org/rec/conf/mediaeval/JinYMX17
}}
==THUHCSI in MediaEval 2017 Emotional Impact of Movies Task==
THUHCSI in MediaEval 2017 Emotional Impact of Movies Task
Zitong Jin, Yuqi Yao, Ye Ma, Mingxing Xu
Key Laboratory of Pervasive Computing, Ministry of Education
Tsinghua National Laboratory for Information Science and Technology (TNList)
Department of Computer Science and Technology, Tsinghua University, Beijing, China
{jzt15,yaoyq15,ma-y17}@mails.tsinghua.edu.cn,xumx@tsinghua.edu.cn
ABSTRACT All input features are scaled into vectors of zero mean and unit
In this paper we describe our team’s approach to MediaEval 2017 variance for normalization.
Challenge Emotional Impact of Movies. Except for the baseline fea- Prediction models. Two aspects of models are adopted in our
tures, we use OpenSMILE toolbox to extract audio features eGeMAPS experiments, which are traditional machine learning models and
from video clips. We also aim at the continuous flow of emotion, time-sequential models. Specifically, the traditional models con-
where using time-sequential models such as LSTM will be useful sist of Support Vector Regression (SVR) and AdaBoost while the
and effective. Fusion methods are also considered and discussed time-sequential ones are Long-Short Term Memory (LSTM) mod-
in this paper. The evaluation results of our experiments show that els. The LSTM models may capture the emotional flow of video
our features and models are competitive in both valence / arousal and enhance the performance. We take the problem as a Sequence-
and fear prediction, indicating our approaches’ effectiveness. to-One regression problem and the input features of LSTM models
are segmented in a 10-second-long sliding window of 5 seconds
1 INTRODUCTION overlapping.
The MediaEval 2017 Challenge Emotional Impact of Movies con- All models are trained separately for valence and arousal.
sists of two subtasks. Subtask 1 aims at Valence/Arousal predic- Fusion methods. To combine features of different modalities,
tion while subtask 2 aims at Fear prediction. Long movies are con- except for the early fusion method which simply concatenates dif-
sidered for both cases and prediction needs to be given every 5 ferent features, late fusion method is also considered. As for the
seconds for the consecutive ten seconds’ segment. LIRIS-ACCEDE traditional prediction models, average fusion is used to avoid over-
[1, 2] dataset is used for training and testing, including both dis- fitting. As for the LSTM models, the hidden vectors of several LSTM
crete and continuous sections of data. For more details, please refer models taking different inputs are fused using an one-layer fully-
to [5]. connected network to obtain final prediction, which is trained with
Video affective analysis and prediction is an important and chal- LSTM models simultaneously.
lenging issue, which has drawn the attention of many researchers After fusion, to reduce the fluctuation of output and smooth out
recently. The Emotional Impact of Movies task has been held for the random noise, a 25-frame-long triangle filter is applied to each
three years, so there are many participants who took part in the video.
challenge in 2015 and 2016 [4, 9].
2.2 Subtask 2: Fear prediction
2 APPROACH Feature extraction. We use the same feature sets as Subtask 1.
In this section, we will describe the main approaches for the sub- However, the main problem and the biggest challenge in Subtask 2
tasks, including feature extraction, pre-processing, prediction mod- is that the samples are so unbalanced that simply predicting “zero”
els, fusion and post-processing methods. obtains the accuracy score of 84.34% in the test set (see Run 4).
Therefore, to solve the unbalanced problem, SMOTE (Synthetic Mi-
2.1 Subtask 1: valence / arousal prediction nority Over-sampling TEchnique [3]) method is adopted after fea-
Feature extraction. Except for the baseline features provided ture extraction to re-sample. The main idea of SMOTE algorithm is
by the organizers, the extended Geneva Minimalistic Acoustic Pa- to generate new samples for minorities using interpolation, which
rameter Set (eGeMAPS) [6] is extracted from audio channel, which will make it more balanced.
contains 88 features and has been proved effective in the same
Prediction models. Random Forest model is adopted in fear
task of last year [8]. In our experiments, we extract them with the
prediction, which may behave better than Support Vector Machine
OpenSMILE toolkit [7] from 5-second-long segments which are cut
(SVM) in unbalanced problem. We first use Random Forest model
from original videos in advance.
to obtain the probability of predicting fear (“one”) for each video
As for the visual features, the general purpose visual features
clip. Then we set up the decision threshold p, and predict fear when
provided by the organizers (except CNN features) are merged into
the probability is larger than p. The value of p are adjusted accord-
one large feature. This is mainly on account of the fact that these
ing to the validation set’s results. Due to the time constaints, we
features are short and complementary, and that combining them
didn’t try the LSTM model for Subtask 2.
can greatly reduce the training workload to try every one of them.
Copyright held by the owner/author(s).
Fusion methods. Similar to Subtask 1, both early and late fu-
MediaEval’17, 13-15 September 2017, Dublin, Ireland sion are used. In late fusion, the probability of different models are
averaged to get one probability.
MediaEval’17, 13-15 September 2017, Dublin, Ireland Zitong Jin, Yuqi Yao, Ye Ma, Mingxing Xu
Table 1: Results of Subtask 1 on test set Table 2: Results of Subtask 2 on test set
Valence Arousal Runs Accuracy Precision Recall f1
Runs
MSE r MSE r Run 1 0.7352 0.0206 0.0530 0.0239
Run 1 0.2230 -0.0985 0.1577 0.2261 Run 2 0.8153 0.2318 0.2781 0.2352
Run 2 0.1670 -0.0990 0.1269 -0.0122 Run 3 0.8461 0.2035 0.0208 0.0371
Run 3 0.1833 0.3707 0.1166 0.3213 Run 4 0.8434 0.0000 0.0000 0.0000
Run 4 0.2074 -0.0111 0.1318 0.2708 Run 5 0.8469 0.2383 0.2186 0.2165
Run 5 0.2046 0.0122 0.1300 0.2750
3.2 Subtask 2: fear prediction
3 EXPERIMENTS AND RESULTS
We’ve submitted 5 runs for fear prediction, all using Random Forest
In this section, we will describe our specific runs in more detail and model, listed below:
show the results. Note that all the hyper-parameters are selected Run 1: Random Forest + other visual features.
due to the results of validation set, and the ratio of training data Run 2: Random Forest + VGG.
and validation data is 4:1. Run 3: Random Forest + all visual features.
Run 4: All predicting “zero” (just for test)
3.1 Subtask 1: valence / arousal prediction Run 5: Late fusion of Run 1 and Run 2.
We’ve submitted 5 runs for valence / arousal prediction, where the From Table 2 we can see that, Run 2 using VGG features achieve
first two use LSTM and the other ones use SVR and AdaBoost, all best results on recall and f1, while Run 5 using late fusion achieve
listed below: best results on accuracy and precision. As mentioned before, the
Run 1: For valence, 2-layer LSTM model of hidden size 500 tak- problem of subtask 2 is very unbalanced, and the fear samples are
ing eGeMAPS as input; For arousal, 3-layer LSTM model of hidden much fewer. Therefore, there is no surprise that accuracy and pre-
size 500 taking VGG as input. cision are one pair while recall and f1 are the other pair. Predicting
Run 2: For valence, late fusion of three 2-layer LSTM models of more “zeros” will lead to higher accuracy while lower recall, and
hidden size 1000 taking eGeMAPS, VGG and other visual features vice versa.
as input respectively; For arousal, the input features are Emobase, When considering f1 score, which is the harmonic mean of both
eGeMAPS and CEDD respectively. precision and recall, Run 2 using VGG feature performs best, which
Run 3: For both valence and arousal, SVR model taking VGG as confronts with the result of subtask 1 that CNN features contain
input. useful information for emotion analysis.
Run 4: For valence, AdaBoost model taking eGeMAPS as input;
For arousal, AdaBoost model taking other visual features as input. 4 CONCLUSION AND DISCUSSION
Run 5: For both valence and arousal, late fusion of Run 3 and
In this paper, we illustrate our approach to the MediaEval 2017
Run 4.
Challenge “Emotional Impact of Movies” task. In valence / arousal
In detail, the “other visual features” in Run 2 and 4 means the
prediction subtask, both LSTM and SVR models are trained and
concatenation of all the visual features except the CNN feature.
compared. In fear prediction subtask, Random Forest model using
CEDD means Color and Edge Directivity Descriptor, which is one
different features are compared. Besides, early fusion and late fu-
of the visual feature provided. VGG means CNN features extracted
sion are adopted in experiments, which shows promising results
using VGG16 fc6 layer.
in some aspects.
From Table 1 we can see that, the best run of valence MSE is
However, some problems have not been solved yet. For instance,
Run 2, using late fusion of LSTM models. Run 3 achieves the best
some of the LSTM models tend to predict similar values of all time,
results on other metrics, using SVR model and VGG feature. No-
leading to a very low Pearson’s r, which may be caused by inappro-
tice that Run 2, the LSTM late fusion method, is better at MSE
priate experiment configuration. Unbalanced problem in subtask 2
than Run 1, the single LSTM model, which means late fusion of
still exists even using SMOTE algorithm, which means changing
three models utilizes different information in different features and
models or features could make no big difference, and all predicting
enhances the performance to some extent. However, LSTM mod-
“zero” can still obtain a very high accuracy. These problems remain
els perform worse in Pearson’s r, compared to traditional machine
to be solved in the future.
learning models. This could be because LSTM models tend to pre-
dict similar values of all time, and thus obtain lower MSE and lower
Pearson’s r. ACKNOWLEDGMENTS
Taken together, Run 3 using SVR and VGG achieves best results, This work was partially supported by the National High Technol-
which means CNN features may contain useful information for ogy Research and Development Program of China (863 program)
emotion analysis, and traditional model could behave well when (2015AA016305) and the National Natural Science Foundation of
trained properly. China (61433018, 61171116).
Emotional Impact of Movies Task MediaEval’17, 13-15 September 2017, Dublin, Ireland
REFERENCES
[1] Yoann Baveye, Emmanuel Dellandréa, Christel Chamaret, and Lim-
ing Chen. 2015. Deep learning vs. kernel methods: Performance for
emotion prediction in videos. In Affective Computing and Intelligent
Interaction (ACII), 2015 International Conference on. IEEE, 77–83.
[2] Yoann Baveye, Emmanuel Dellandrea, Christel Chamaret, and Liming
Chen. 2015. LIRIS-ACCEDE: A video database for affective content
analysis. IEEE Transactions on Affective Computing 6, 1 (2015), 43–55.
[3] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip
Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling tech-
nique. Journal of artificial intelligence research 16 (2002), 321–357.
[4] Emmanuel Dellandréa, Liming Chen, Yoann Baveye, Mats Sjöberg,
and Christel Chamaret. 2016. The MediaEval 2016 Emotional Impact
of Movies Task. In Proceedings of MediaEval 2016 Workshop. Hilver-
sum, Netherlands.
[5] Emmanuel Dellandréa, Martijn Huigsloot, Liming Chen, Yoann Bav-
eye, and Mats Sjöberg. 2017. The MediaEval 2017 Emotional Impact
of Movies Task. In Proceedings of MediaEval 2017 Workshop. Dublin,
Ireland.
[6] Florian Eyben, Klaus Scherer, Khiet Truong, Bjorn Schuller, Johan
Sundberg, Elisabeth Andre, Carlos Busso, Laurence Devillers, Julien
Epps, and Petri Laukka. 2016. The Geneva Minimalistic Acoustic Pa-
rameter Set (GeMAPS) for Voice Research and Affective Computing.
IEEE Transactions on Affective Computing 12, 2 (2016), 190–202.
[7] Florian Eyben, Felix Weninger, Florian Gross, and Björn Schuller.
2013. Recent developments in openSMILE, the munich open-source
multimedia feature extractor. In Proceedings of the 21st ACM interna-
tional conference on Multimedia. ACM, 835–838.
[8] Ye Ma, Zipeng Ye, and Mingxing Xu. 2016. THU-HCSI at MediaEval
2016: Emotional Impact of Movies Task. In Proceedings of MediaEval
2016 Workshop. Hilversum, Netherlands.
[9] Mats Sjöberg, Yoann Baveye, Hanli Wang, Vu Lam Quang, Bogdan
Ionescu, Emmanuel Dellandréa, Markus Schedl, Claire-Hélène De-
marty, and Liming Chen. 2015. The MediaEval 2015 Affective Impact
of Movies Task.. In Proceedings of MediaEval 2015 Workshop. Wurzen,
Germany.