GLA in MediaEval 2018 Emotional Impact of Movies Task
                                       Jennifer J. Sun1 , Ting Liu, Gautam Prasad
                                                          Google LLC
                                      jjsun@caltech.edu,{liuti,gautamprasad}@google.com

ABSTRACT                                                        with late fusion, and valence along with arousal was predicted
We present our methods for the MediaEval 2018 Emotional         jointly. Our method is implemented using TensorFlow and
Impact of Movies Task to predict the expected valence and       we used the Adam optimizer [12] in all our experiments.
arousal continuously in movies. Our approach leverages image,
audio, and face based features computed using pre-trained       2.1    Feature Extraction
neural networks. These features were computed over time         We extracted image, audio and face features from each frame
and modeled using a gated recurrent unit (GRU) based            of the movies. Our image features (Inception-Image) were
network followed by a mixture of experts model to compute       from the Inception network [19] pre-trained on ImageNet [16].
multiclass predictions. We smoothed these predictions using     We extracted audio features using AudioSet [9], which is a
a Butterworth filter for our final result.                      VGG-inspired model pre-trained on YouTube-8M [1]. For the
                                                                face features (Inception-Face), we focused on the two largest
1    INTRODUCTION                                               faces in each frame and used an Inception based architecture
                                                                trained on faces [17]. Since the movies were human-focused,
The Emotional Impact of Movies Task [7], part of the Media-     faces were found in most of the scenes. We compared these
Eval 2018 benchmark, provides participants with a common        features with those used in last years competition that in-
dataset for predicting the expected emotional impact from       cluded image features computed using VGG16 [18] and audio
videos. We focused on the first subtask in the challenge:       features computed using openSMILE [8]. All our features
predicting the expected valence and arousal continuously        were extracted at one frame per second.
(every second) in movies. The dataset provided by the task is
the LIRIS-ACCEDE dataset [3, 4], which is annotated with
self-reported valence and arousal every second from multiple
                                                                2.2    Temporal Models
annotators. Since deep neural networks, such as Inception       To model the temporal dynamics of the emotion in the videos,
[19], have millions of trainable parameters, the competition    we used recurrent neural networks. In particular, we used
data may be too limited to train these networks from ran-       LSTMs [10] and GRUs [6] as part of our modeling pipeline
dom initializations. Therefore, we used networks that were      in a sequence-to-one setup with sequence length of 10, 30
pre-trained on larger datasets, such as ImageNet, to extract    or 60 seconds. The self-reported emotions likely depend on
features from the LIRIS-ACCEDE dataset. The extracted           past scenes in movies, so temporal modeling is important
features were used to train our temporal and regression mod-    for this task. In addition, we evaluated TCNs because of
els.                                                            their promising performance in sequence modeling [2] and
   This task is recurring with multiple submissions every       action segmentation [13]. Specifically, we trained an encoder-
year [11, 14]. Our method’s novelty lies in the unique set      decoder TCN using sequences of extracted features to obtain
of features we extracted including image, audio, and face       a sequence of valence and arousal predictions.
features (capitalizing on transfer learning) along with our
model setup, which comprises of a GRU combined with a           2.3    Regression Models
mixture of experts.                                             The input from each modality (image, audio, or face) is fed
                                                                into separate recurrent models. The output we use from each
2    APPROACH                                                   recurrent model is its hidden state which contains information
We approached the valence and arousal prediction as a multi-    on previous data seen by the model. We use the hidden state
variate regression problem. Our objective is to minimize the    corresponding to the final timestamp in the input sequence.
multi-label sigmoid cross-entropy loss and this could allow     The state vectors for each modality are concatenated into a
the model to use potential relationships between the two        single vector and fed into a context gate [15]. Multimodal
dimensions for regression. We first used pre-trained networks   fusion occurs at this stage as we use the learned model to
to extract features. To model the temporal aspects of the       fuse the representation of each modality from the RNN. The
data, the methods we evaluated included long short-term         output of the context gate is then fed into a mixture-of-
memory (LSTM), gated recurrent unit (GRU) and temporal          experts model with another context gate to obtain the final
convolutional network (TCN). Multiple modalities were fused     emotion predictions. We use logistic regression experts with
                                                                a softmax gating network.
1
 This work completed during Jennifer’s internship at Google.       To prevent overfitting, we regularized our models using L2
Copyright held by the owner/author(s).
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France      regularization, dropout and batch normalization. Finally, a
                                                                low pass filter is applied on the predictions to smooth the
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                                                            J. Sun et al.


prediction outputs. In LIRIS-ACCEDE, the measured emo-                   Table 1: Performance of our five models.
tion data vary smoothly in time but our regression outputs
contain high frequency signals. To smooth our outputs, we                             Valence               Arousal
tested weighted moving average filters and low-pass filters                         MSE     PCC           MSE    PCC
(Butterworth filter [5]) as implemented in SciPy.
                                                                          Run 1    0.1193     0.1175     0.1384    0.2911
                                                                          Run 2    0.0945     0.1376     0.1479    0.1957
3    RESULTS AND ANALYSIS
                                                                          Run 3    0.1133     0.1883     0.1778    0.2773
We optimized the hyperparameters of our models to have                    Run 4    0.1073     0.2779     0.1396    0.3513
the best performance on the validation set, which consists                Run 5    0.0837     0.1786     0.1334    0.3358
of 13 movies from the development set. We then trained our
models on the entire development set to run inference on the
test set. Our setup used a batch size of 512.
    Through evaluating our recurrent models, we found that        is likely because by averaging, we decrease the variance of
Inception-Image+AudioSet features had better performance          predictions and thus overall, the mean is closer to the ground
in terms of MSE and PCC compared to VGG16+openSMILE               truth labels. While averaging improves MSE, it does not
features. In some cases, the recurrent model would predict        improve correlation. Our model with the best correlation is
near the mean for both valence and arousal while using            from Run 4.
VGG16+openSMILE. This may be because the features did                We note that using batch normalization during inference
not have enough information for the models to discriminate        increases the variance of our predictions. This is because
between different values of valence and arousal. We also          we are using the batch statistics instead of the population
found a significant increase in performance when we added         statistics from the train set to normalize the batches. Our
the Inception-Face features, which may point to salient infor-    validation results (with repeated runs) as well as test results
mation captured in connection with the expected emotions          show that using batch normalization in this way improves
in the videos.                                                    predictions for valence, but not as much for arousal. This is
    The sequence-to-one recurrent models worked best with         most likely because the statistics of the test set for valence is
longer input sequences of 60 seconds versus those of 10 or        different from train set while the test set statistics for arousal
30 seconds. This may be because the invoked emotion is            may be closer to the train set statistics. One explanation
affected by longer lasting scenes. Our recurrent models also      could be the small size of the dataset so that the statistics of
performed better on the validation set than the TCN and           the train set does not generalize well to the test set.
the GRU models had similar performance to the LSTMs.
We used GRUs for our implementation because GRUs are              4    CONCLUSIONS
computationally simpler than LSTMs. Since we have a small         We found that precomputed features modeling image, audio,
dataset, we wanted to reduce model complexity to prevent          and face in concert with GRUs provided the optimal per-
underfitting. Our temporal model architecture ranged from         formance in predicting the expected valence and arousal in
32 to 256 units and 1 to 2 layers, optimized for each of the      movies for this task. Based on our test set metrics, ensemble
modalities. For post processing, the low-pass Butterworth         methods such as bagging could be useful for this task.
filter worked better than the moving average filter. This is         We found some evidence that recurrent models performed
likely because the Butterworth filter is designed to have a       better than TCN. However since we only evaluated the
frequency response as flat as possible (with no ripples) in the   encoder-decoder TCN more investigation will be necessary
pass-band. Fluctuations of the magnitude response within          for a broader conclusion.
the passband may decrease the accuracy of our regression             The pre-computed features we used to model image, au-
output.                                                           dio, and face information showed better performance when
    In Table 1 we list the performance of our best models that    compared with the VGG16+openSMILE baseline. A future
were submitted to the task. Each of the 5 runs is defined         direction could be to train the network in an end-to-end man-
as follows where we used Inception-Image, AudioSet, and           ner to better capture the frame level data, with the caveat
Inception-Face as the features and a GRU with mixture-of-         that we may need a much larger training dataset.
experts for regression.
    (1) No dropout or batch normalization.                        REFERENCES
    (2) Regularized with dropout and batch normalization.          [1] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul
        Trained on approximately 70% of the data.                      Natsev, George Toderici, Balakrishnan Varadarajan, and
    (3) Regularized with dropout and batch normalization.              Sudheendra Vijayanarasimhan. 2016.         Youtube-8m: A
    (4) Regularized with dropout and batch normalization,              large-scale video classification benchmark. arXiv preprint
        different initialization and epoch.                            arXiv:1609.08675 (2016).
                                                                   [2] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. 2018.
    (5) Average over all runs.
                                                                       An empirical evaluation of generic convolutional and re-
  We see that creating an ensemble from our models by                  current networks for sequence modeling. arXiv preprint
averaging over the runs has the lowest MSE (Run 5). This               arXiv:1803.01271 (2018).
Emotional Impact of Movies Task                                         MediaEval’18, 29-31 October 2018, Sophia Antipolis, France


 [3] Yoann Baveye, Emmanuel Dellandréa, Christel Chamaret,            [19] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon
     and Liming Chen. 2015. Deep learning vs. kernel methods:               Shlens, and Zbigniew Wojna. 2016. Rethinking the incep-
     Performance for emotion prediction in videos. In Affective             tion architecture for computer vision. In Proceedings of the
     Computing and Intelligent Interaction (ACII), 2015 Interna-            IEEE conference on computer vision and pattern recognition.
     tional Conference on. IEEE, 77–83.                                     2818–2826.
 [4] Yoann Baveye, Emmanuel Dellandrea, Christel Chamaret,
     and Liming Chen. 2015. Liris-accede: A video database for
     affective content analysis. IEEE Transactions on Affective
     Computing 6, 1 (2015), 43–55.
 [5] Stephen Butterworth. 1930. On the theory of filter amplifiers.
     Wireless Engineer 7, 6 (1930), 536–541.
 [6] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre,
     Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and
     Yoshua Bengio. 2014. Learning phrase representations us-
     ing RNN encoder-decoder for statistical machine translation.
     arXiv preprint arXiv:1406.1078 (2014).
 [7] Emmanuel Dellandréa, Martijn Huigsloot, Liming Chen,
     Yoann Baveye, and Mats Viktor Sjöberg. 2018. The mediaeval
     2018 emotional impact of movies task. In MediaEval 2018
     Multimedia Benchmark Workshop Working Notes Proceed-
     ings of the MediaEval 2018 Workshop.
 [8] Florian Eyben, Felix Weninger, Florian Gross, and Björn
     Schuller. 2013. Recent developments in opensmile, the munich
     open-source multimedia feature extractor. In Proceedings of
     the 21st ACM international conference on Multimedia. ACM,
     835–838.
 [9] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren
     Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal,
     and Marvin Ritter. 2017. Audio set: An ontology and human-
     labeled dataset for audio events. In Acoustics, Speech and
     Signal Processing (ICASSP), 2017 IEEE International Con-
     ference on. IEEE, 776–780.
[10] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-
     term memory. Neural computation 9, 8 (1997), 1735–1780.
[11] Zitong Jin, Yuqi Yao, Ye Ma, and Mingxing Xu. 2017.
     THUHCSI in MediaEval 2017 Emotional Impact of Movies
     Task. Proc. MediaEval (2017).
[12] Diederik Kingma and Jimmy Ba. 2014. Adam: A method
     for stochastic optimization. arXiv preprint arXiv:1412.6980
     (2014).
[13] Colin Lea, Rene Vidal, Austin Reiter, and Gregory D Hager.
     2016. Temporal convolutional networks: A unified approach to
     action segmentation. In European Conference on Computer
     Vision. Springer, 47–54.
[14] Yang Liu, Zhonglei Gu, and Tobey H Ko. 2017. HKBU
     at MediaEval 2017 Emotional Impact of Movies Task. In
     Mediaeval 2017 Workshop. Dublin, Ireland.
[15] Antoine Miech, Ivan Laptev, and Josef Sivic. 2017. Learnable
     pooling with Context Gating for video classification. arXiv
     preprint arXiv:1706.06905 (2017).
[16] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
     jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
     Aditya Khosla, Michael Bernstein, and others. 2015. Imagenet
     large scale visual recognition challenge. International Journal
     of Computer Vision 115, 3 (2015), 211–252.
[17] Florian Schroff, Dmitry Kalenichenko, and James Philbin.
     2015. Facenet: A unified embedding for face recognition
     and clustering. In Proceedings of the IEEE conference on
     computer vision and pattern recognition. 815–823.
[18] Karen Simonyan and Andrew Zisserman. 2014. Very deep con-
     volutional networks for large-scale image recognition. arXiv
     preprint arXiv:1409.1556 (2014).