=Paper= {{Paper |id=Vol-1739/MediaEval_2016_paper_37 |storemode=property |title=RUC at MediaEval 2016 Emotional Impact of Movies Task: Fusion of Multimodal Features |pdfUrl=https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_37.pdf |volume=Vol-1739 |dblpUrl=https://dblp.org/rec/conf/mediaeval/ChenJ16 }} ==RUC at MediaEval 2016 Emotional Impact of Movies Task: Fusion of Multimodal Features== https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_37.pdf
 RUC at MediaEval 2016 Emotional Impact of Movies Task:
             Fusion of Multimodal Features

                                                    Shizhe Chen, Qin Jin
                                School of Information, Renmin University of China, China
                                               {cszhe1, qjin}@ruc.edu.cn


ABSTRACT                                                          2.2    Image Modality
In this paper, we present our approaches for the Mediae-             Hand-crafted Visual Features: We extract the Hue-
val Emotional Impact of Movies Task. We extract features          Saturation Histogram (hsh) to describe the color informa-
from multiple modalities including audio, image and motion        tion and the Dense SIFT (DSIFT) features to represent the
modalities. SVR and Random Forest are used as our re-             visual appearance information. For hsh features, we quan-
gression models and late fusion is applied to fuse different      tize the hue to 30 levels and the saturation to 32 levels. For
modalities. Experimental results show that the multimodal         DSIFT features, we use Fisher Vector encoding to construc-
late fusion is beneficial to predict global affects and con-      t the video-level features. Then kernel PCA is utilized to
tinuous arousal and using CNN features can further boost          reduce the dimensionality into 4096.
the performance. But for continuous valence prediction the           DCNN Features: To explore the performance of dif-
acoustic features are superior to other features.                 ferent pre-trained CNN models, we extract multiple layers
                                                                  from different CNN models including inception-v3 [9], VGG-
1.    INTRODUCTION                                                16 and VGG-19 [10]. All the CNN features are applied with
                                                                  mean pooling to generate video-level representations.
   The 2016 Emotion Impact of Movies Task [1] involves t-
wo subtasks: global and continuous affects prediction. The        2.3    Motion Modality
global subtask requires participants to predict the induced          To exploit the temporal information in the video, we ex-
valance and arousal values for the short video clips, while the   tract the improved Dense Trajectory (iDT) [11] and the C3D
affects values should be continuously predicted every second      features [12].For iDT features, HOG, HOF and MBH fea-
for long movies in the continuous subtask. In the following       tures are densely extracted from the video and encoded with
sections, we describe the multimodal features, models and         Fisher Vector. Then kernel PCA is used to reduce dimen-
experiments in details.                                           sionality into 4096. For C3D features, we extract activations
                                                                  from the penultimate layer for every non-overlap 16 frames
2.    FEATURE EXTRACTION                                          and use mean pooling to aggregate them into one vector.
                                                                     The challenge also provides baseline features [13] for global
2.1    Audio Modality                                             subtask, which consists of acoustic and visual features.
   Statistical Acoustic Features: Statistical acoustic fea-
tures are proved to be very effective in speech emotion recog-    3.    EXPERIMENTS
nition. We use the open-source toolkit OpenSMILE [2] to
extract three kinds of features IS09, IS10 and IS13, which        3.1    Experimental Setting
uses the configuration in INTERSPEECH 2009 [3], 2010 [4]
and 2013 [5] Paralinguistic challenge respectively. The dif-        In the global subtask, there are 9,800 video clips from 160
ference between these features is that features in the later      movies in the development set. We randomly select 6093,
years cover more low-level features and statistical functions.    1761 and 1946 videos as our local training, validation and
   MFCC-based Features: The Mel-Frequency Cepstral                testing sets respectively. Video clips from the same movies
Coefficients (MFCCs) [6] are the most widely used low-level       are kept in the same set. In the continuous subtask, the 30
features. Therefore, we use MFCCs as our frame-level fea-         movies in the development set are also divided into 3 parts
ture and apply two encoding strategies, Bag-of-Audio-Words        with 24 for training, 3 for validation and 3 for testing.
(BoW) [7] and Fisher Vector Encoding (FV) [8], to trans-            We train SVR and Random Forest for each kind of fea-
form the set of MFCCs to the sentence-level features. For         tures and use grid search to select the best hyper-parameters.
mfccBoW features, the acoustic codebook is trained by K-          For SVR, we explore linear and RBF kernels and tune the
means with 1024 clusters. For mfccFV features, we use the         cost from 2−5 to 212 and the epsilon-tube from 0.1 to 0.4. For
GMM to train the codebook with 8 mixtures.                        Random Forest, the number of trees and the depth of trees
   In the continuous subtask, the audio features are extracted    are tuned from 100 to 1000 and from 3 to 20 respectively.
with the window of 10s and shift of 1s to cover more context.     We apply late fusion to fuse different features by training a
                                                                  second-layer model (linear SVR) with input of the best pre-
                                                                  dictions for each kind of features using the local validation
Copyright is held by the author/owner(s).
MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether-     set. We use Sequential Backward Selection algorithm to find
lands.                                                            the best subset of feature types for late fusion.
Figure 1: MSE Performance of Different Features                   Figure 3: PCC Performance of Different Features
for Global Arousal Prediction on Local Testing Set                for Continuous Prediction on Local Testing Set




Figure 2: MSE Performance of Different Features                   Table 1: The Submission Results for Global and
for Global Valence Prediction on Local Testing Set                Continuous Affects Prediction
                                                                                         Arousal       Valence
                                                                                      MSE     PCC   MSE     PCC
                                                                                run1 1.479 0.405    0.218   0.312
                                                                      Global
                                                                                run2 1.510 0.467 0.201 0.419
                                                                                run1 0.120 0.147    0.102   0.106
                                                                    Continuous run2 0.121 0.236 0.108       0.132
                                                                                run3 0.122    0.191 0.099 0.142


                                                                  mance of the validation and testing set. Figure 3 shows the
                                                                  PCC results of different features. The mfccFV feature per-
                                                                  forms the best in both arousal and valence prediction. The
                                                                  settings for the submitted three runs are as follows. In run1,
                                                                  we apply late fusion over mfccFV, IS09 and IS10 for arousal
3.2    Global Affects Prediction                                  and use the mfccFV SVR for valence. In run2, mfccFV, IS09,
   In the global subtask, we use the mean standard error          IS10 and inc_fc are late fused for arousal and mfccFV and
(MSE) as evaluation metric. Figure 1 presents MSE of the          IS09 are late fused for valence. The run3 late fuses mfccFV,
different features for arousal prediction. The audio modality     IS09 and inc_fc for arousal and use mfccFV Random Forest
performs the best. Since the baseline feature contains multi-     for valence. In our experiment, late fusion is beneficial for
modal cues, it achieves the second best performance follow-       the arousal prediction but not for valence prediction.
ing our mfccBoW feature. The run1 is the late fusion of all the
audio features, baseline and iDT features. In the run2 sys-       3.4    Submitted Runs
tem, besides the features used in run1, c3d_fc6, vgg16_fc7          In Table 1, we list our results on the challenge testing set.
and vgg19_fc6 features are also used in late fusion. The          For the global subtask, comparing between run1 and run2,
arousal prediction performance is significantly improved by       fusing CNN features can greatly improve the arousal and va-
the multimodal late fusion.                                       lence prediction performance. For the continuous subtask,
   The MSE of different features for global valence prediction    the fusion of image and audio cues improves the arousal pre-
is shown in Figure 2. The image modality features especial-       diction performance. But for valence prediction, the mfccFV
ly the CNN features are better than other modalities for          feature alone achieves the best results.
valence prediction. The run1 system consists of baseline,
IS10, mfccBoW, mfccFv and hsh. The run2 system also us-           4.    CONCLUSIONS
es c3d_fc6, c3d_prob, vgg16_fc6, vgg16_fc7 and the fea-
tures in run1. Although the late fusion performance does            In this paper, we present the multimodal approach to pre-
not outperform the unimodal performance with CNN vgg16            dict global and continuous affects. The best result on the
features on our local testing set, it might be more robust        global subtask is achieved by the late fusion of audio, im-
than using one single feature.                                    age and motion modalities. However, for the continuous
                                                                  subtask, the mfccFV feature significantly outperforms other
3.3    Continuous Affects Prediction                              features and benefits little from late fusion on valence pre-
                                                                  diction. In the future work, we will explore more features
   In the continuous subtask, we use the Pearson Correla-         for continuous affects prediction and use LSTMs to model
tion Coefficient (PCC) for performance evaluation instead         the temporal structure of the videos.
of MSE. Because the labels in the continuous subtask have
closer temporal connections than those in the global subtask
and thus the shape of the prediction curve is more importan-      5.    ACKNOWLEDGMENTS
t. Since the testing set is relative small and the performance     This work is supported by National Key Research and
is quite unstable in the evaluation, we average the perfor-       Development Plan under Grant No. 2016YFB1001202.
6.   REFERENCES                                                     continuously spoken sentences. Readings in Speech
 [1] Emmanuel Dellandréa, Liming Chen, Yoann Baveye,               Recognition, 28(4):65–74, 1990.
     Mats Sjöberg, and Christel Chamaret. The mediaeval        [7] Stephanie Pancoast and Murat Akbacak. Softening
     2016 emotional impact of movies task. In MediaEval             quantization in bag-of-audio-words. In ICASSP 2014 -
     2016 Workshop, Hilversum, Netherlands, Oct. 20-21,             2014 IEEE International Conference on Acoustics,
     2016.                                                          Speech and Signal Processing, pages 1370–1374, 2014.
 [2] Florian Eyben, Martin Llmer, and Björn Schuller.          [8] Jorge Sanchez, Florent Perronnin, Thomas Mensink,
     Opensmile: the munich versatile and fast open-source           and Jakob Verbeek. Image classification with the
     audio feature extractor. In ACM International                  fisher vector: Theory and practice. International
     Conference on Multimedia, Mm, pages 1459–1462,                 Journal of Computer Vision, 105(3):222–245, 2013.
     2010.                                                      [9] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,
 [3] Björn W. Schuller, Stefan Steidl, and Anton Batliner.         Jonathon Shlens, and Zbigniew Wojna. Rethinking the
     The INTERSPEECH 2009 emotion challenge. In                     inception architecture for computer vision. arXiv
     INTERSPEECH 2009, 10th Annual Conference of the                preprint arXiv:1512.00567, 2015.
     International Speech Communication Association,           [10] Karen Simonyan and Andrew Zisserman. Very deep
     Brighton, United Kingdom, September 6-10, 2009,                convolutional networks for large-scale image
     pages 312–315, 2009.                                           recognition. Computer Science, 2014.
 [4] Björn Schuller, Anton Batliner, Stefan Steidl, and       [11] Heng Wang and Cordelia Schmid. Action recognition
     Dino Seppi. Recognising realistic emotions and affect          with improved trajectories. In Proceedings of the IEEE
     in speech: State of the art and lessons learnt from the        International Conference on Computer Vision, pages
     first challenge. Speech Communication,                         3551–3558, 2013.
     53(9-10):1062–1087, 2011.                                 [12] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo
 [5] Björn Schuller, Stefan Steidl, Anton Batliner,                Torresani, and Manohar Paluri. Learning
     Alessandro Vinciarelli, Klaus Scherer, Fabien                  spatiotemporal features with 3d convolutional
     Ringeval, Mohamed Chetouani, Felix Weninger,                   networks. In 2015 IEEE International Conference on
     Florian Eyben, and Erik Marchi. The interspeech 2013           Computer Vision (ICCV), pages 4489–4497. IEEE,
     computational paralinguistics challenge: Social                2015.
     signals, conflict, emotion, autism. Proceedings of        [13] Yoann Baveye, Emmanuel Dellandrea, Christel
     Interspeech, pages 148–152, 2013.                              Chamaret, and Liming Chen. Deep learning vs. kernel
 [6] Steven B. Davis. Comparison of parametric                      methods: Performance for emotion prediction in
     representations for monosyllabic word recognition in           videos. In ACII, pages 77–83, 2015.