BOUN-NKU in MediaEval 2017 Emotional Impact of Movies Task Nihan Karslioglu1, Yasemin Timar1, Albert Ali Salah1, Heysem Kaya2 1Boğaziçi University, İstanbul, Turkey, {nihan.karslioglu, yasemin.timar, salah}@boun.edu.tr 2Namık Kemal University, Tekirdağ, Turkey, hkaya@nku.edu.tr ABSTRACT In this paper, we present our approach for the Emotional Impact of Movies task of Mediaeval 2017 Challenge, involving multimodal fusion for predicting arousal and valence for movie clips. In our system, we have two pipelines. In the first one, we extracted audio/visual features, and used a combination of PCA, Fisher vector encoding, feature selection, and extreme learning machine classifiers. In the second one, Figure 1: First system for valence-arousal prediction. we focused on the classifiers, rather than on feature selection. 1 As audio features, we computed Mel-frequency Cepstral 1 INTRODUCTION Coefficients (MFCC 0-12), from 32ms windows (with 50% overlap). First and second derivatives were added, resulting in The challenge we tackle in this paper is the prediction of a 39-dimensional feature vector. affective content of video clips, denoted by valence and arousal scores. We used well-known regression models on the We used three types of visual features in addition to these audio-visual domain for this purpose. The feature sets audio features. The Hue Saturation Histogram (HSH) feature is extracted by the organizers have been used to form a baseline a 1023-dimensional histogram of color pixels, in 33 hue and system to understand the properties and relations of the most 31 saturation levels. They were sampled from one frame per important features for this task. The description of the task is second, and frames were resized to 240x320. For the Dense provided in [1]. SIFT feature, the frames were further resized to 120x160, and Dense SIFT features [4] were extracted at scales {4,6,8}, at 7 One of the proposed tasks is the prediction of "fear", which is pixel intervals and once for every 30 frames of video. Finally, represented by a binary value in the ground truth. However, we used the VGG FC6 feature provided by the organizers, the sections denoted with fear are rare (only 5%); and this extracted from a deep neural network trained for image requires classifiers capable of dealing with class imbalance recognition. (e.g. Gradient Boosting Classifier). We have not worked on this part of the challenge. After reducing the dimensionality of the features by 50% via PCA, we encoded them with Fisher vectors (FV) [5], which The Emotional Impact of Movies task has been included in the measures how much the features deviate from a background MediaEval challenges since 2015. Various approaches have probability model, in this case a mixture of Gaussians. The been studied for the problem in terms of features and number of clusters were selected as 32 for Dense SIFT and regression models in recent years [2]. Audio features, visual MFCC, and a single Gaussian was used for HSH and VGG-FC6. descriptors and deep learning based features have been popular among the participants of the 2016 challenge [3]. We normalized the feature vectors with signed square root and L2 normalization. 2 FIRST APPROACH A ranking based feature selection approach was applied using Our first pipeline, given in Fig.1, extracts a number of features, Random Sample versus Labels Canonical Correlation Analysis reduces their dimension with PCA, summarizes them with Filter (SLCCA-Rand) method [6]. The main idea is to apply Fisher vector encoding, and further applies a feature selection CCA between features and target labels, then sort the absolute stage prior to classification. value of the projection weights to get a ranking. Features that sum up to 99% of the total weight for each modality are selected in this approach. Copyright is held by the owner/authors(s). For regression, Extreme Learning Machines (ELM) were MediaEval’17, 13-15 September 2017, Dublin, Ireland applied for both arousal and valence prediction tasks [7]. Grid MediaEval’17, 13-15 September 2017, Dublin, Ireland Karslioglu et al. search is applied to find the best parameters of ELM. The second run is a linear weighted combination of the Regularization coefficient was searched from the range of predictions used in the first run. In the third run, while an [0.01,1000] with exponential steps. Radial basis function average of MFCC and FC6 are computed for valence, the (RBF) and linear kernels were tested. The RBF kernel scale average of MFCC, HSH and FC6 are computed for arousal. In parameter is optimized in the range of [0.01,1000], also with the fourth run, linear combination scores of MFCC, Dense SIFT exponential steps. Pearson Correlation Coefficient (PCC) is and FC6 are computed for valence, and linear combination taken as performance measure, and optimized over 5-fold scores of MFCC, HSH and FC6 are computed for arousal. For cross validation on the development partition. Results in the fifth run, the regression pipelines are selected after grid Table 1 are obtained on the test set, for which the ground search, resulting in four separate SVRs with RBF kernels (with truth was sequestered. best scoring hyper parameters from cross-validation). AV scores of test-set data are fused and smoothed to generate the 3 SECOND APPROACH run outputs. Our second approach used audio and visual features presented by the organizers, without any dimensionality Table 1: Arousal/Valence Prediction Results (MSE) reduction. Dimensionalities are 1.582 for audio, and 1.271 for visual features, respectively. Early fusion of the visual features Run Approach Arousal Valence (except FC6) are fed to Random Forest and support vector MSE PCC MSE PCC regressors (SVR). Hyper parameters are explored with grid 1 1, simple avg. 0.1231 0.1289 0.1859 0.0263 search. For SVR, the cost and gamma parameters range from 2 1, weighted comb. 0.1433 0.0986 0.2249 0.0464 0.001 to 100. For Random Forests, the number of trees range 3 1, selective avg. 0.1237 0.1046 0.1889 0.0386 from 100 to 1000, and the maximum number of features per 4 1, linear selective0.1434 0.0990 0.2251 0.0460 tree from 3 to 20. Five train and test folds (balanced according comb. to duration and fear labels) are defined to ensure that each 5 2, smoothed 0.1126 0.2186 0.1881 0.0904 movie appears in either in the train set or the test set. The best regressors were chosen via grid search, and tested on When we compare Run1 and Run2 from Table 1, we can say each fold to evaluate the performance on a subset of the that combining all features from the first approach with simple average fusion method is better than combining them development set. According to MSE and PCC scores on each with weighted fusion technique for arousal task but this fold, the regressors are trained with the best group. The audio situation is opposite for valence in terms of PCC. Comparing and visual subsystem scores are fused with simple averaging, Run1 with Run3 and Run4, Dense Sift is important for better and the scores for a given movie are smoothed with Holt- arousal prediction in PCC metric. Run1 shows that fusing all Winters exponentially weighted moving average method [8]. features from the first approach with simple averaging The pipeline is visually presented in Figure 2. method gives the best MSE result for valence. The best results are obtained in Run5 for arousal prediction for two metrics. PCC result of Run5 for valence is also the best result 4 RESULTS AND ANALYSIS between the other runs. We also observe that prediction of We submitted five runs for the valence/arousal prediction arousal is more accurate compared to valence. task. The first run is the average scores of MFCC, HSH, Dense SIFT and VGG-FC6 subsystems, and obtains our lowest MSE on The computation power of our computer is limited in terms of time and memory. Therefore we plan to choose more the valence task. components for higher explained variance for PCA and more clusters for GMM for the first 4 runs in our future works. In addition to this, we plan to employ CCA to extract arousal and valence correlates as mid-level features, so as to optimize PCC and MSE measures simultaneously. ACKNOWLEDGMENTS This work is supported by Bogazici University Project BAP 16A01P4 and by the BAGEP Award of the Science Academy. REFERENCES [1] E. Dellandréa, M. Huigsloot, L. Chen, Y. Baveye, M. Sjöberg. “The MediaEval 2017 Emotional Impact of Movies Task,” Figure 2: Pipeline without dimensionality reduction. Proc. of the MediaEval 2017 Workshop, Dublin, Ireland, Sept. 13-15, 2017. Emotional Impact of Movies Task Karslioglu et al. [2] Y. Baveye, E. Dellandrea, C. Chamaret, L. Chen, “Deep Learning vs. Kernel Methods: Performance for Emotion Prediction in Videos,” In Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII), 2015. [3] E. Dellandrea, L. Chen, Y. Baveye, M. Sjöberg, C. Chamaret, “The Mediaeval 2016 Emotional Impact of Movies Task”, In MediaEval 2016 Workshop, 2016. [4] A. Bosch, A. Zisserman and X. Munoz, “Image classification using random forests and ferns,” In IEEE 11th International Conference on Computer Vision, (ICCV 2007), 2007. [5] F. Perronnin and C. Dance, “Fisher kernels on visual vocabularies for image categorization,” In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-8, 2007. [6] H. Kaya, T. Özkaptan, A.A. Salah and F. Gürgen, “Random discriminative projection based feature selection with application to conflict recognition,” IEEE Signal Processing Letters, 22(6), pp. 671-675, 2015. [7] G.B. Huang, H. Zhou, X. Ding and R. Zhang, “Extreme learning machine for regression and multiclass classification”. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 42(2), pp. 513-529, 2012. [8] P. R. Winters, "Forecasting Sales by Exponentially Weighted Moving Averages," Management Science, 6(3), pp. 324–342, 1960. doi:10.1287/mnsc.6.3.324.