RUC at MediaEval 2016 Emotional Impact of Movies Task: Fusion of Multimodal Features Shizhe Chen, Qin Jin School of Information, Renmin University of China, China {cszhe1, qjin}@ruc.edu.cn ABSTRACT 2.2 Image Modality In this paper, we present our approaches for the Mediae- Hand-crafted Visual Features: We extract the Hue- val Emotional Impact of Movies Task. We extract features Saturation Histogram (hsh) to describe the color informa- from multiple modalities including audio, image and motion tion and the Dense SIFT (DSIFT) features to represent the modalities. SVR and Random Forest are used as our re- visual appearance information. For hsh features, we quan- gression models and late fusion is applied to fuse different tize the hue to 30 levels and the saturation to 32 levels. For modalities. Experimental results show that the multimodal DSIFT features, we use Fisher Vector encoding to construc- late fusion is beneficial to predict global affects and con- t the video-level features. Then kernel PCA is utilized to tinuous arousal and using CNN features can further boost reduce the dimensionality into 4096. the performance. But for continuous valence prediction the DCNN Features: To explore the performance of dif- acoustic features are superior to other features. ferent pre-trained CNN models, we extract multiple layers from different CNN models including inception-v3 [9], VGG- 1. INTRODUCTION 16 and VGG-19 [10]. All the CNN features are applied with mean pooling to generate video-level representations. The 2016 Emotion Impact of Movies Task [1] involves t- wo subtasks: global and continuous affects prediction. The 2.3 Motion Modality global subtask requires participants to predict the induced To exploit the temporal information in the video, we ex- valance and arousal values for the short video clips, while the tract the improved Dense Trajectory (iDT) [11] and the C3D affects values should be continuously predicted every second features [12].For iDT features, HOG, HOF and MBH fea- for long movies in the continuous subtask. In the following tures are densely extracted from the video and encoded with sections, we describe the multimodal features, models and Fisher Vector. Then kernel PCA is used to reduce dimen- experiments in details. sionality into 4096. For C3D features, we extract activations from the penultimate layer for every non-overlap 16 frames 2. FEATURE EXTRACTION and use mean pooling to aggregate them into one vector. The challenge also provides baseline features [13] for global 2.1 Audio Modality subtask, which consists of acoustic and visual features. Statistical Acoustic Features: Statistical acoustic fea- tures are proved to be very effective in speech emotion recog- 3. EXPERIMENTS nition. We use the open-source toolkit OpenSMILE [2] to extract three kinds of features IS09, IS10 and IS13, which 3.1 Experimental Setting uses the configuration in INTERSPEECH 2009 [3], 2010 [4] and 2013 [5] Paralinguistic challenge respectively. The dif- In the global subtask, there are 9,800 video clips from 160 ference between these features is that features in the later movies in the development set. We randomly select 6093, years cover more low-level features and statistical functions. 1761 and 1946 videos as our local training, validation and MFCC-based Features: The Mel-Frequency Cepstral testing sets respectively. Video clips from the same movies Coefficients (MFCCs) [6] are the most widely used low-level are kept in the same set. In the continuous subtask, the 30 features. Therefore, we use MFCCs as our frame-level fea- movies in the development set are also divided into 3 parts ture and apply two encoding strategies, Bag-of-Audio-Words with 24 for training, 3 for validation and 3 for testing. (BoW) [7] and Fisher Vector Encoding (FV) [8], to trans- We train SVR and Random Forest for each kind of fea- form the set of MFCCs to the sentence-level features. For tures and use grid search to select the best hyper-parameters. mfccBoW features, the acoustic codebook is trained by K- For SVR, we explore linear and RBF kernels and tune the means with 1024 clusters. For mfccFV features, we use the cost from 2−5 to 212 and the epsilon-tube from 0.1 to 0.4. For GMM to train the codebook with 8 mixtures. Random Forest, the number of trees and the depth of trees In the continuous subtask, the audio features are extracted are tuned from 100 to 1000 and from 3 to 20 respectively. with the window of 10s and shift of 1s to cover more context. We apply late fusion to fuse different features by training a second-layer model (linear SVR) with input of the best pre- dictions for each kind of features using the local validation Copyright is held by the author/owner(s). MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether- set. We use Sequential Backward Selection algorithm to find lands. the best subset of feature types for late fusion. Figure 1: MSE Performance of Different Features Figure 3: PCC Performance of Different Features for Global Arousal Prediction on Local Testing Set for Continuous Prediction on Local Testing Set Figure 2: MSE Performance of Different Features Table 1: The Submission Results for Global and for Global Valence Prediction on Local Testing Set Continuous Affects Prediction Arousal Valence MSE PCC MSE PCC run1 1.479 0.405 0.218 0.312 Global run2 1.510 0.467 0.201 0.419 run1 0.120 0.147 0.102 0.106 Continuous run2 0.121 0.236 0.108 0.132 run3 0.122 0.191 0.099 0.142 mance of the validation and testing set. Figure 3 shows the PCC results of different features. The mfccFV feature per- forms the best in both arousal and valence prediction. The settings for the submitted three runs are as follows. In run1, we apply late fusion over mfccFV, IS09 and IS10 for arousal 3.2 Global Affects Prediction and use the mfccFV SVR for valence. In run2, mfccFV, IS09, In the global subtask, we use the mean standard error IS10 and inc_fc are late fused for arousal and mfccFV and (MSE) as evaluation metric. Figure 1 presents MSE of the IS09 are late fused for valence. The run3 late fuses mfccFV, different features for arousal prediction. The audio modality IS09 and inc_fc for arousal and use mfccFV Random Forest performs the best. Since the baseline feature contains multi- for valence. In our experiment, late fusion is beneficial for modal cues, it achieves the second best performance follow- the arousal prediction but not for valence prediction. ing our mfccBoW feature. The run1 is the late fusion of all the audio features, baseline and iDT features. In the run2 sys- 3.4 Submitted Runs tem, besides the features used in run1, c3d_fc6, vgg16_fc7 In Table 1, we list our results on the challenge testing set. and vgg19_fc6 features are also used in late fusion. The For the global subtask, comparing between run1 and run2, arousal prediction performance is significantly improved by fusing CNN features can greatly improve the arousal and va- the multimodal late fusion. lence prediction performance. For the continuous subtask, The MSE of different features for global valence prediction the fusion of image and audio cues improves the arousal pre- is shown in Figure 2. The image modality features especial- diction performance. But for valence prediction, the mfccFV ly the CNN features are better than other modalities for feature alone achieves the best results. valence prediction. The run1 system consists of baseline, IS10, mfccBoW, mfccFv and hsh. The run2 system also us- 4. CONCLUSIONS es c3d_fc6, c3d_prob, vgg16_fc6, vgg16_fc7 and the fea- tures in run1. Although the late fusion performance does In this paper, we present the multimodal approach to pre- not outperform the unimodal performance with CNN vgg16 dict global and continuous affects. The best result on the features on our local testing set, it might be more robust global subtask is achieved by the late fusion of audio, im- than using one single feature. age and motion modalities. However, for the continuous subtask, the mfccFV feature significantly outperforms other 3.3 Continuous Affects Prediction features and benefits little from late fusion on valence pre- diction. In the future work, we will explore more features In the continuous subtask, we use the Pearson Correla- for continuous affects prediction and use LSTMs to model tion Coefficient (PCC) for performance evaluation instead the temporal structure of the videos. of MSE. Because the labels in the continuous subtask have closer temporal connections than those in the global subtask and thus the shape of the prediction curve is more importan- 5. ACKNOWLEDGMENTS t. Since the testing set is relative small and the performance This work is supported by National Key Research and is quite unstable in the evaluation, we average the perfor- Development Plan under Grant No. 2016YFB1001202. 6. REFERENCES continuously spoken sentences. Readings in Speech [1] Emmanuel Dellandréa, Liming Chen, Yoann Baveye, Recognition, 28(4):65–74, 1990. Mats Sjöberg, and Christel Chamaret. The mediaeval [7] Stephanie Pancoast and Murat Akbacak. Softening 2016 emotional impact of movies task. In MediaEval quantization in bag-of-audio-words. In ICASSP 2014 - 2016 Workshop, Hilversum, Netherlands, Oct. 20-21, 2014 IEEE International Conference on Acoustics, 2016. Speech and Signal Processing, pages 1370–1374, 2014. [2] Florian Eyben, Martin Llmer, and Björn Schuller. [8] Jorge Sanchez, Florent Perronnin, Thomas Mensink, Opensmile: the munich versatile and fast open-source and Jakob Verbeek. Image classification with the audio feature extractor. In ACM International fisher vector: Theory and practice. International Conference on Multimedia, Mm, pages 1459–1462, Journal of Computer Vision, 105(3):222–245, 2013. 2010. [9] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, [3] Björn W. Schuller, Stefan Steidl, and Anton Batliner. Jonathon Shlens, and Zbigniew Wojna. Rethinking the The INTERSPEECH 2009 emotion challenge. In inception architecture for computer vision. arXiv INTERSPEECH 2009, 10th Annual Conference of the preprint arXiv:1512.00567, 2015. International Speech Communication Association, [10] Karen Simonyan and Andrew Zisserman. Very deep Brighton, United Kingdom, September 6-10, 2009, convolutional networks for large-scale image pages 312–315, 2009. recognition. Computer Science, 2014. [4] Björn Schuller, Anton Batliner, Stefan Steidl, and [11] Heng Wang and Cordelia Schmid. Action recognition Dino Seppi. Recognising realistic emotions and affect with improved trajectories. In Proceedings of the IEEE in speech: State of the art and lessons learnt from the International Conference on Computer Vision, pages first challenge. Speech Communication, 3551–3558, 2013. 53(9-10):1062–1087, 2011. [12] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo [5] Björn Schuller, Stefan Steidl, Anton Batliner, Torresani, and Manohar Paluri. Learning Alessandro Vinciarelli, Klaus Scherer, Fabien spatiotemporal features with 3d convolutional Ringeval, Mohamed Chetouani, Felix Weninger, networks. In 2015 IEEE International Conference on Florian Eyben, and Erik Marchi. The interspeech 2013 Computer Vision (ICCV), pages 4489–4497. IEEE, computational paralinguistics challenge: Social 2015. signals, conflict, emotion, autism. Proceedings of [13] Yoann Baveye, Emmanuel Dellandrea, Christel Interspeech, pages 148–152, 2013. Chamaret, and Liming Chen. Deep learning vs. kernel [6] Steven B. Davis. Comparison of parametric methods: Performance for emotion prediction in representations for monosyllabic word recognition in videos. In ACII, pages 77–83, 2015.