THU-HCSI at MediaEval 2016: Emotional Impact of Movies Task Ye Ma, Zipeng Ye, Mingxing Xu Key Laboratory of Pervasive Computing, Ministry of Education Tsinghua National Laboratory for Information Science and Technology (TNList) Department of Computer Science and Technology, Tsinghua University, Beijing, China {y-ma13, yezp13}@mails.tsinghua.edu.cn, xumx@tsinghua.edu.cn ABSTRACT CNN model on the ILSVRC-2014 dataset pretrained by VG- In this paper we describe our team’s approach to MediaE- G team [10] and replaced the last softmax layer (which was val 2016 Challenge “Emotional Impact of Movies”. Excep- used for classification) with a fully-connected layer and an t for the baseline features, we extract audio features and Euclidean loss layer (which was used for regression). The image features from video clips. We deploy Convolutional input images are static frames extracted from video clips at Neural Network (CNN) to extract image features and use the rate of 2 Hz and the output labels are valence and arousal OpenSMILE toolbox to extract audio ones. We also study annotations. We trained the valence and arousal CNNs with multi-scale approach at different levels aiming at the contin- Caffe [8] separately and used the first fully-connected layer uous prediction task, using Long-short Term Memory (LST- of CNN models as the output features which were reduced M) and Bi-directional Long-short Term Memory (BLSTM) by using Principal Component Analysis (PCA) in our ex- models. Fusion methods are also considered and discussed periments. in this paper. The evaluation results show our approaches’ effectiveness. 2.1.2 Prediction Models Support Vector Regression (SVR) models with RBF ker- nel were trained for valence and arousal separately. We’ve 1. INTRODUCTION tried both early fusion and late fusion for audio- and visual- The MediaEval 2016 Challenge “Emotional Impact of Movies” features, which will be elaborated in Section 3. consists of two subtasks: Global emotion prediction of a short video clip (around 10 seconds) and continuous emo- 2.2 Subtask 2: continuous emotion prediction tion prediction of a complete movie. LIRIS-ACCEDE [2, 1] dataset is used in the challenge. A brief introduction to the 2.2.1 Feature Extraction dataset for training and testing as well as the details of these For audio features, we used a set of features provided by two subtasks has been given in [3]. In this paper, we mainly INTERSPEECH 2013 Computational Paralinguistics Chal- discuss the approach employed by our system. lenge [11] which consists of 130 dimensions. For image fea- tures, the same CNN feature as Subtask 1 is chosen and 2. APPROACH reduced by PCA to 256 dimensions. 2.1 Subtask 1: global emotion prediction 2.2.2 Prediction Model We applied Long-Short Term Memory (LSTM) [7] to mod- 2.1.1 Feature Extraction el the context information in movies. Since the emotion Except for the baseline features provided by the organiz- evoked by a video clip is not only associated with the pre- ers, there are two types of features used in our experiments, vious content but also the future one, Bidirectional Long which are audio features and image features. Audio features Short-Term Memory (BLSTM) [6] is considered as a better only utilize the audio wave files extracted from video files, choice because of its ability to use both previous and future and image features only utilize the static frames extracted information. from videos. In our experiments, two types of models with three layers As to the audio features, we use the extended Geneva Min- were used. Type 1 has three LSTM layers and type 2 is the imalistic Acoustic Parameter Set (eGeMAPS), which con- same except the middle layer is BLSTM. The dimensions of sists of 88 features and has been used in many emotion the two hidden layers are as listed in Table 1. recognition tasks for their potential and theoretical signif- icance [4]. In our experiments, we extract these features 2.2.3 Multi-scale Fusion and Post-processing from each video clip with the OpenSMILE toolkit [5]. Similar to [9], total five models of different scales were The image features were extracted by using a fine-tuned trained with different sequence lengths, i.e., 8,16,32,64 and Convolutional Neural Network (CNN). We adopt a 19-layer 128, respectively. For each scale, we selected one appropriate model from 3 trails. We divided the whole dataset into three parts: 70% for Copyright is held by the author/owner(s). MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether- training, 20% for validation and 10% as the test set for fusion lands. and post-processing. Table 1: Dimensions of the 2 hidden layers Table 3: Continuous result of test set Feature Model Layer1 Layer2 Valence Arousal Runs Audio Type 1 128 64 MSE r MSE r Audio Type 2 128 32 Run 1 0.1086 0.017 0.1601 0.054 Audio + Image Type 1 256 128 Run 2 0.1276 -0.023 0.1244 -0.023 Audio + Image Type 2 256 64 Run 3 0.1016 -0.002 0.1354 0.030 Run 4 0.1029 -0.003 0.1294 0.026 Run 5 0.1018 0.000 0.1376 0.052 Table 2: Global result of test set Valence Arousal Runs MSE r MSE r The model was type 2 for valence and type 1 for arousal. Run 2: Audio and video feature vectors were concate- Run 1 0.2188 0.2680 1.4674 0.2725 nate as a multi-modality feature vector. The scale was 16 Run 2 0.2170 0.2740 1.5910 0.3444 for valence and 128 for arousal. The model was type 2 for Run 3 0.2140 0.2955 1.5312 0.2667 valence and type 1 for arousal. Run 3: We used the same features as Run 1. Multi-scale models were trained and fused by using simple average to generate the final results. Finally, we applied a post-processing with a sliding trian- Run 4: We used the same features as Run 2. Multi-scale gular filter to smooth the final results. In our experiments, models were trained and fused by using simple average to the filter window size is 9. generate the final results. Run 5: Same as Run 3 except the fusion in which we 3. EXPERIMENTS AND RESULTS weighted those models’ results with different weights, i.e. In this section, we will describe our methods and experi- 0.4, 0.3, 0.2 and 0.1 from the low loss to high loss, respec- ments in more detail and show the results. tively. 3.1 Subtask 1: global emotion prediction 4. CONCLUSION We’ve submitted three runs for global prediction task in In this paper, we illustrate our approach to the MediaEval total, listed below: 2016 Challenge “Emotional Impact of Movies” task. As to Run 1: (Baseline + eGeMAPS + CNN) features + SVR global emotion prediction subtask, combining the features + early fusion learnt from video by using CNN enhances the regression Run 2: (Baseline + eGeMAPS) features + SVR + early performance of arousal with early fusion as well as the per- fusion formance of valence with late fusion. Run 3: (Baseline + eGeMAPS + CNN) features + SVR As to continuous prediction, the best result obtained in + late fusion this paper is Run 3 for valence and Run 2 for arousal. Fu- In detail, CNN features in Run 1 and Run 3 are com- sion by multi-scale has a good performance for valence. For pressed using PCA algorithm, which is 512 dimensions for arousal, the Run 3 is better than Run 1 but Run 4 is worse arousal and 128 dimensions for valence. These dimension- than Run 2, so fusion could not always make it better. s are decided upon the results of 5-fold cross-validation on training set. Besides, the weight of late fusion in Run 3 is 5. ACKNOWLEDGMENTS also determined on validation. From Table 2 we can see that, the best run of valence is This work was partially supported by the 863 Program of Run 3 while the best of arousal is Run 1, which are late China (2015AA016305), the National Natural Science Foun- fusion and early fusion respectively. Notice that runs using dation of China (61171116, 61433018) and the Major Project CNN features performs better on arousal than those who of the National Social Science Foundation of China (13&ZD189). don’t, indicating that image features may contain more in- formation about emotion’s polarity than audio ones. Be- 6. REFERENCES sides, it is worth mentioning that the arousal’s Pearson r of [1] Y. Baveye, E. Dellandréa, C. Chamaret, and L. Chen. Run 2 is the highest among all runs, implying that higher Deep learning vs. kernel methods: Performance for relevance may lead to higher MSE loss to some content. emotion prediction in videos. In Affective Computing and Intelligent Interaction (ACII), 2015 International 3.2 Subtask 2: continuous emotion prediction Conference on, pages 77–83. IEEE, 2015. In order to select the best model for each scale and fusion, [2] Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen. we designed a series of experiments. We have five different Liris-accede: A video database for affective content scales, two feature sets and two types of models. For each analysis. IEEE Transactions on Affective Computing, possible combination, we trained 3 trials with randomized 6(1):43–55, 2015. initial weights. Therefore, there are total 60 (5 × 2 × 2) [3] E. Dellandréa, L. Chen, Y. Baveye, M. Sjöberg, and experiments. C. Chamaret. The mediaeval 2016 emotional impact of Run 1: Only audio features were used. The sequence movies task. In Proceedings of MediaEval 2016 length of LSTM was 16 for valence while 64 for arousal. Workshop, Hilversum, Netherlands, 2016. [4] F. Eyben, K. Scherer, K. Truong, B. Schuller, J. Sundberg, E. Andre, C. Busso, L. Devillers, J. Epps, and P. Laukka. The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing. IEEE Transactions on Affective Computing, 12(2):190–202, 2016. [5] F. Eyben, F. Weninger, F. Gross, and B. Schuller. Recent developments in opensmile, the munich open-source multimedia feature extractor. In Proceedings of the 21st ACM international conference on Multimedia, pages 835–838. ACM, 2013. [6] A. Graves and J. r. Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 18(5):602610, 2005. [7] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997. [8] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014. [9] X. Li, J. Tian, M. Xu, Y. Ning, and L. Cai. Dblstm-based multi-scale fusion for dynamic emotion prediction in music. pages 1–6, 2016. [10] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [11] F. Weninger, F. Eyben, B. W. Schuller, M. Mortillaro, and K. R. Scherer. On the acoustics of emotion in audio: What speech, music, and sound have in common. Frontiers in Psychology, 4(2):292, 2013.