BOUN-NKU in MediaEval 2017 Emotional Impact of Movies Task
                                Nihan Karslioglu1, Yasemin Timar1, Albert Ali Salah1,
                                                   Heysem Kaya2
               1Boğaziçi University, İstanbul, Turkey, {nihan.karslioglu, yasemin.timar, salah}@boun.edu.tr
                                2Namık Kemal University, Tekirdağ, Turkey, hkaya@nku.edu.tr

ABSTRACT
In this paper, we present our approach for the Emotional
Impact of Movies task of Mediaeval 2017 Challenge, involving
multimodal fusion for predicting arousal and valence for
movie clips. In our system, we have two pipelines. In the first
one, we extracted audio/visual features, and used a
combination of PCA, Fisher vector encoding, feature selection,
and extreme learning machine classifiers. In the second one,
                                                                       Figure 1: First system for valence-arousal prediction.
we focused on the classifiers, rather than on feature selection.
1


                                                                     As audio features, we computed Mel-frequency Cepstral
1 INTRODUCTION                                                       Coefficients (MFCC 0-12), from 32ms windows (with 50%
                                                                     overlap). First and second derivatives were added, resulting in
The challenge we tackle in this paper is the prediction of
                                                                     a 39-dimensional feature vector.
affective content of video clips, denoted by valence and
arousal scores. We used well-known regression models on the          We used three types of visual features in addition to these
audio-visual domain for this purpose. The feature sets               audio features. The Hue Saturation Histogram (HSH) feature is
extracted by the organizers have been used to form a baseline        a 1023-dimensional histogram of color pixels, in 33 hue and
system to understand the properties and relations of the most        31 saturation levels. They were sampled from one frame per
important features for this task. The description of the task is     second, and frames were resized to 240x320. For the Dense
provided in [1].                                                     SIFT feature, the frames were further resized to 120x160, and
                                                                     Dense SIFT features [4] were extracted at scales {4,6,8}, at 7
One of the proposed tasks is the prediction of "fear", which is
                                                                     pixel intervals and once for every 30 frames of video. Finally,
represented by a binary value in the ground truth. However,
                                                                     we used the VGG FC6 feature provided by the organizers,
the sections denoted with fear are rare (only 5%); and this
                                                                     extracted from a deep neural network trained for image
requires classifiers capable of dealing with class imbalance
                                                                     recognition.
(e.g. Gradient Boosting Classifier). We have not worked on this
part of the challenge.                                               After reducing the dimensionality of the features by 50% via
                                                                     PCA, we encoded them with Fisher vectors (FV) [5], which
The Emotional Impact of Movies task has been included in the
                                                                     measures how much the features deviate from a background
MediaEval challenges since 2015. Various approaches have
                                                                     probability model, in this case a mixture of Gaussians. The
been studied for the problem in terms of features and
                                                                     number of clusters were selected as 32 for Dense SIFT and
regression models in recent years [2]. Audio features, visual
                                                                     MFCC, and a single Gaussian was used for HSH and VGG-FC6.
descriptors and deep learning based features have been
popular among the participants of the 2016 challenge [3].            We normalized the feature vectors with signed square root
                                                                     and L2 normalization.
2 FIRST APPROACH                                                     A ranking based feature selection approach was applied using
Our first pipeline, given in Fig.1, extracts a number of features,   Random Sample versus Labels Canonical Correlation Analysis
reduces their dimension with PCA, summarizes them with               Filter (SLCCA-Rand) method [6]. The main idea is to apply
Fisher vector encoding, and further applies a feature selection      CCA between features and target labels, then sort the absolute
stage prior to classification.                                       value of the projection weights to get a ranking. Features that
                                                                     sum up to 99% of the total weight for each modality are
                                                                     selected in this approach.

Copyright is held by the owner/authors(s).                           For regression, Extreme Learning Machines (ELM) were
MediaEval’17, 13-15 September 2017, Dublin, Ireland                  applied for both arousal and valence prediction tasks [7]. Grid
MediaEval’17, 13-15 September 2017, Dublin, Ireland                                                                  Karslioglu et al.


search is applied to find the best parameters of ELM.                The second run is a linear weighted combination of the
Regularization coefficient was searched from the range of            predictions used in the first run. In the third run, while an
[0.01,1000] with exponential steps. Radial basis function            average of MFCC and FC6 are computed for valence, the
(RBF) and linear kernels were tested. The RBF kernel scale           average of MFCC, HSH and FC6 are computed for arousal. In
parameter is optimized in the range of [0.01,1000], also with        the fourth run, linear combination scores of MFCC, Dense SIFT
exponential steps. Pearson Correlation Coefficient (PCC) is          and FC6 are computed for valence, and linear combination
taken as performance measure, and optimized over 5-fold              scores of MFCC, HSH and FC6 are computed for arousal. For
cross validation on the development partition. Results in            the fifth run, the regression pipelines are selected after grid
Table 1 are obtained on the test set, for which the ground           search, resulting in four separate SVRs with RBF kernels (with
truth was sequestered.                                               best scoring hyper parameters from cross-validation). AV
                                                                     scores of test-set data are fused and smoothed to generate the
3 SECOND APPROACH                                                    run outputs.
Our second approach used audio and visual features
presented by the organizers, without any dimensionality                    Table 1: Arousal/Valence Prediction Results (MSE)
reduction. Dimensionalities are 1.582 for audio, and 1.271 for
visual features, respectively. Early fusion of the visual features    Run         Approach           Arousal       Valence
(except FC6) are fed to Random Forest and support vector                                          MSE     PCC   MSE      PCC
regressors (SVR). Hyper parameters are explored with grid              1         1, simple avg.  0.1231 0.1289 0.1859 0.0263
search. For SVR, the cost and gamma parameters range from              2      1, weighted comb.  0.1433 0.0986 0.2249 0.0464
0.001 to 100. For Random Forests, the number of trees range            3       1, selective avg. 0.1237 0.1046 0.1889 0.0386
from 100 to 1000, and the maximum number of features per               4      1, linear selective0.1434 0.0990 0.2251 0.0460
tree from 3 to 20. Five train and test folds (balanced according                      comb.
to duration and fear labels) are defined to ensure that each           5         2, smoothed      0.1126 0.2186 0.1881 0.0904
movie appears in either in the train set or the test set. The
best regressors were chosen via grid search, and tested on           When we compare Run1 and Run2 from Table 1, we can say
each fold to evaluate the performance on a subset of the             that combining all features from the first approach with
                                                                     simple average fusion method is better than combining them
development set. According to MSE and PCC scores on each
                                                                     with weighted fusion technique for arousal task but this
fold, the regressors are trained with the best group. The audio      situation is opposite for valence in terms of PCC. Comparing
and visual subsystem scores are fused with simple averaging,         Run1 with Run3 and Run4, Dense Sift is important for better
and the scores for a given movie are smoothed with Holt-             arousal prediction in PCC metric. Run1 shows that fusing all
Winters exponentially weighted moving average method [8].            features from the first approach with simple averaging
The pipeline is visually presented in Figure 2.                      method gives the best MSE result for valence. The best
                                                                     results are obtained in Run5 for arousal prediction for two
                                                                     metrics. PCC result of Run5 for valence is also the best result
4 RESULTS AND ANALYSIS                                               between the other runs. We also observe that prediction of
We submitted five runs for the valence/arousal prediction            arousal is more accurate compared to valence.
task. The first run is the average scores of MFCC, HSH, Dense
SIFT and VGG-FC6 subsystems, and obtains our lowest MSE on           The computation power of our computer is limited in terms
                                                                     of time and memory. Therefore we plan to choose more
the valence task.
                                                                     components for higher explained variance for PCA and more
                                                                     clusters for GMM for the first 4 runs in our future works. In
                                                                     addition to this, we plan to employ CCA to extract arousal and
                                                                     valence correlates as mid-level features, so as to optimize PCC
                                                                     and MSE measures simultaneously.

                                                                     ACKNOWLEDGMENTS
                                                                     This work is supported by Bogazici University Project BAP
                                                                     16A01P4 and by the BAGEP Award of the Science Academy.

                                                                     REFERENCES
                                                                     [1] E. Dellandréa, M. Huigsloot, L. Chen, Y. Baveye, M. Sjöberg.
                                                                         “The MediaEval 2017 Emotional Impact of Movies Task,”
   Figure 2: Pipeline without dimensionality reduction.                  Proc. of the MediaEval 2017 Workshop, Dublin, Ireland,
                                                                         Sept. 13-15, 2017.
Emotional Impact of Movies Task                                    Karslioglu et al.


[2] Y. Baveye, E. Dellandrea, C. Chamaret, L. Chen, “Deep
    Learning vs. Kernel Methods: Performance for Emotion
    Prediction in Videos,” In Humaine Association Conference
    on Affective Computing and Intelligent Interaction (ACII),
    2015.
[3] E. Dellandrea, L. Chen, Y. Baveye, M. Sjöberg, C. Chamaret,
    “The Mediaeval 2016 Emotional Impact of Movies Task”, In
    MediaEval 2016 Workshop, 2016.
[4] A. Bosch, A. Zisserman and X. Munoz, “Image classification
    using random forests and ferns,” In IEEE 11th International
    Conference on Computer Vision, (ICCV 2007), 2007.
[5] F. Perronnin and C. Dance, “Fisher kernels on visual
    vocabularies for image categorization,” In IEEE Conference
    on Computer Vision and Pattern Recognition, pp. 1-8, 2007.
[6] H. Kaya, T. Özkaptan, A.A. Salah and F. Gürgen, “Random
    discriminative projection based feature selection with
    application to conflict recognition,” IEEE Signal Processing
    Letters, 22(6), pp. 671-675, 2015.
[7] G.B. Huang, H. Zhou, X. Ding and R. Zhang, “Extreme
    learning machine for regression and multiclass
    classification”. IEEE Transactions on Systems, Man, and
    Cybernetics, Part B (Cybernetics), 42(2), pp. 513-529, 2012.
[8] P. R. Winters, "Forecasting Sales by Exponentially
    Weighted Moving Averages," Management Science, 6(3), pp.
    324–342, 1960. doi:10.1287/mnsc.6.3.324.