GIBIS at MediaEval 2018: Predicting Media Memorability Task Ricardo Manhães Savii1,2 , Samuel Felipe dos Santos1 , and Jurandy Almeida1 1 GIBIS Lab, Instituto de Ciência e Tecnologia, Universidade Federal de São Paulo, Brazil 2 Dafiti Group, Brazil ricardo.savii@dafiti.com.br,{felipe.samuel,jurandy.almeida}@unifesp.br ABSTRACT HMP encodes an entire video into a single histogram represent- This paper describes the GIBIS team experience in the Predicting ing its overall motion dynamics. From this, we can consider the Media Memorability Task at MediaEval 2018. In this task, we were re- HMP vector as a hash identifying each video as a point in a high- quired to develop an approach to predict a score reflecting whether dimensional space. This idea of space is the foundation for the use videos are memorable or not, considering short-term memorabil- of the KNR and SVR algorithms [4]. ity and long-term memorability. Our proposal relies on different For reproducibility, the KNR and SVR implementations used learning strategies: for long-term memorability, we adopted k-NN comes from the scikit-learn python package [4]. The steps for the regressors trained on hand-crafted motion features; and for short- experiment are: split the dev-set at random into ten folds, then train term memorability, we trained deep learning models. one regression model on each fold. In this way, we get ten different models. They are used as an ensemble to predict the memorability KEYWORDS over the HMP features of the test-set. The average output is con- sidered as the final score and we used the 95% confidence interval Media Memorability, k-NN Regressor, Deep Learning as the output confidence. 1 INTRODUCTION 2.2 Short-Term Approach For short-term memorability, we use a deep learning model based The Predicting Media Memorability task is part of the MediaEval on the C3D architecture [5]. However, our C3D model has some 2018 Benchmarking Initiative for Multimedia Evaluation. The goal differences from the original C3D model. Here, we include a multi- of this task is to automatically predict a memorability score for headed layer by adding two fully connected layers at the top of a video reflecting its probability to be remembered. For this, it is the C3D model. To provide confidence over prediction values, we provided a dataset composed of 10,000 short, soundless videos split implemented a multi-output model, a two-headed model. The heads into 8,000 videos for the development set and 2,000 videos for the are: (1) a regression output (i.e., sigmoid activation) used to predict test set. Also, pre-computed visual features are provided to facilitate the memorability score; and (2) a classification head predicting the participation. For more details about this task, please, refer to [2]. discretized memorability bucket. The short-memorability score was In this paper, we explore two main approaches: (1) for long-term discretized in 10 buckets and used as classes for prediction. In this memorability, an ensemble of ten KNR (k-Nearest Neighbor Regres- way, the classification head using a softmax activation provides a sor) or SVR (Support Vector Regression) [4] trained on the provided confidence value over the responses of the regression head. HMP (Histogram of Motion Patterns) [1] feature; and (2) for short- The implemented C3D model follows a 3D convolution and term memorability, a deep learning model based on 3D convolutions 3D max pooling architecture1 and it outputs a fully connected and 3D pooling layers, known as C3D (Convolution3D) [5]. layer with 2048 neurons. This is the first of three fully connected layers that feed a multi-head output for regression and classification. 2 OUR APPROACH Figure 1 shows the network architecture of our first experiment. The proposed approach exploits different strategies for long-term and short-term memorability. The former relies on hand-crafted motion features extracted with HMP whereas the latter uses data- driven features learned with C3D. One limitation of C3D is its capacity to capture subtle but long-term motion dynamics, as it requires to break a video into small clips. Unlike C3D, HMP captures motion dynamics of a video as a whole, and not just parts. 2.1 Long-Term Approach Our proposed approach for the long-term memorability subtask Figure 1: One-stream C3D architecture (adapted from [5]). consists of using the pre-computed HMP feature [1] in conjunction In our second experiment, we used the same C3D model, however, with two regression algorithms: SVR (Support Vector Regression) we consider a two-stream network. For this, a second C3D model and KNR (k-Nearest Neighbor Regressor). receives as input the optical flow [3]. The outputs of each C3D model are concatenated to form the first fully connected layer. Copyright held by the owner/author(s). MediaEval’18, 29-31 October 2018, Sophia Antipolis, France 1 Our C3D implementation is available at: https://github.com/ricoms/deep_ memorability/blob/master/deep_memorability/trainer2/video_c3d.py MediaEval’18, 29-31 October 2018, Sophia Antipolis, France R. Savii et al. The motivation for this two-stream network is to evaluate if our development set. From it, we achieved a Spearman value of 0.11845 C3D model can improve its results with this extra information. For for the test set. reproducibility, we used the dense version of optical flow provided Table 2: Long-term memorability results. with the OpenCV library2 . Figure 2 shows the overall architecture Approach Spearman Pearson MSE for this experiment. HMP + SVR kernel RBF −0.026 −0.009 0.02 Dev. Set HMP + 5-NN regressor −0.004 −0.002 0.02 HMP + 20-NN regressor 0.009 0.022 0.02 HMP + 30-NN regressor −0.003 0.014 0.02 Test Set HMP + 20-NN regressor 0.11845 0.11966 0.02011 Table 3 presents the results for the development and test sets considering the short-term memorability subtask. The results for Spearman and Pearson metrics are not a number (NaN) and there- fore they were not included in this table. The reason is because the C3D models assigned a same memorability score for all the videos. This turned impossible to calculate Spearman and Pearson Figure 2: Two-stream C3D architecture (adapted from [5]). correlation metrics due to lack of variance. Despite we tried several adjustments to hyper-parameters and pre-processing, it turned out For both networks, the input data are normalized to real values during the experiments that the models did not improve results in the range [−1, 1] and resized to 128 × 171 pixels. Also, the C3D after reaching a value close to the average memorability score for model limits the input to a frame sequence with a predefined length the development set, possibly indicating a lack of fit. (typically, 16 frames) and, for this reason, a sequence of 16 consec- Table 3: Short-term memorability results. utive frames from each video was selected at random and used as input to the network. Optical flow generates a frame sequence with MSE Approach Dev. Set Test Set one less frame and, for easier the implementation, a last frame filled 1-C3D (video) 0.0043 0.00702 with zeros was appended at the end. 2-C3D (video, optical flow) 0.0046 0.00699 For training, a different loss function was used for each head: mean squared logarithmic error for the regression head and categor- ical crossentropy for the classification head. Then, a weighted sum 4 DISCUSSION AND OUTLOOK of these individual losses with weights 1.0 and 0.7, respectively, was It is important to notice that HMP and C3D have an important computed as the final loss to be minimized by a RMSProp optimizer difference: HMP captures motion dynamics of a video as whole with a learning rate of 0.0015. whereas C3D is limited to a short window of fixed duration. An intention of future work is to analyze if features encoding long-term 3 RESULTS AND ANALYSIS motion dynamics, like HMP or RNN (Recurrent Neural Network), We submit three different runs configured as shown in Table 1. are better for predicting video memorability than those capturing We calibrated the long-term memorability subtask through 10-fold short-term motion dynamics, like C3D or ORB. cross-validation on the development data and use a holdout method We can think about some reasons for the failure of our deep with 10% of the development data for validation to calibrate the learning models. First, we constrain a full length video to a se- short-term memorability subtask. The evaluation metrics are: Spear- quence of 16 consecutive frames. Smarter strategies to capture the man’s rank correlation, Pearson correlation coefficient, and MSE temporal structure of a video, like RNN with LSTM (Long-Short (Mean Squared Error). The former is the official metric for the task. Term Memory), could led to improvements. Second, we trained our deep neural networks from scratch. As the training set is rather Table 1: Configuration of the submitted runs. small, data augmentation could be used to improve the results. Subtask Run Configuration Another promising direction is to combine different features. For Long-term memorability 1 HMP + 20-NN regressor short memorability, we fused optical flow and video data. Would it 1 1-C3D (video) improve results if we fuse video (visual data) and captions (textual Short-term memorability 2 2-C3D (video, optical flow) data) provided for the task? Or other visual features, like HMP? Table 2 presents the results for the development and test sets con- ACKNOWLEDGMENTS sidering the long-term memorability subtask. In the development We thank the São Paulo Research Foundation - FAPESP (grant set, we tested different regression models: SVR with RBF kernel 2016/06441-7), the Brazilian National Council for Scientific and and KNR. Also, the values experimented for the parameter k of Technological Development - CNPq (grants 423228/2016-1 and KNR were 5, 20, and 30. Notice that KNR performs better than 313122/2017-2) and the Brazilian Federal Agency for Coordina- SVR for Spearman and Pearson metrics and the best results were tion for the Improvement of Higher Education Personnel - CAPES achieved by KNR with k = 20. Therefore, we submit one run for the (grant 1703269) for funding. We gratefully acknowledge the support long-term memorability subtask considering our best result on the of NVIDIA Corporation with the donation of the Titan Xp GPU 2 https://docs.opencv.org/3.4/d7/d8b/tutorial_py_lucas_kanade.html used for this research. Predicting Media Memorability Task MediaEval’18, 29-31 October 2018, Sophia Antipolis, France REFERENCES [1] J. Almeida, N. J. Leite, and R. S. Torres. 2011. Comparison of Video Sequences with Histograms of Motion Patterns. In IEEE International Conference on Image Processing (ICIP’11). Brussels, Belgium, 3673–3676. [2] R. Cohendet, C-H. Demarty, N. Duong, M. Sjöberg, B. Ionescu, T-T. Do, and F. Rennes. 2018. MediaEval 2018: Predicting Media Memorability Task. In Proc. of the MediaEval 2018 Workshop. Sophia Antipolis, France. [3] L. Fan, W-B. Huang, C. Gan, S. Ermon, B. Gong, and J. Huang. 2018. End- to-End Learning of Motion Representation for Video Understanding. In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR’18). Salt Lake City, UT, USA, 6016–6025. [4] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. VanderPlas, A. Passos, D. Cour- napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830. [5] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. In IEEE International Conference on Computer Vision (ICCV’15). Santiago, Chile, 4489–4497.