Predicting Media Memorability Using Deep Features and Recurrent Network Duy-Tue Tran-Van, Le-Vu Tran, Minh-Triet Tran University of Science, Vietnam National University-Ho Chi Minh City tvdtue@apcs.vn,tlvu@apcs.vn,tmtriet@fit.hcmus.edu.vn ABSTRACT LSTM based approach in VM has only been tried in [3]. However, In the Predicting Media Memorability Task at the MediaEval Chal- the results did not seem promising because of their small dataset. lenge 2018, our team proposes an approach that uses deep visual features and recurrent network to predict videos’ memorability. Fea- 3 MEMORABILITY PREDICTING tures are extracted from CNN for a number of frames in each video. Feature extraction: In order to resolve the temporal factor, instead We forward these through a LSTM network to model the struc- of using C3D [11], we decide to break the video into multiple frames ture of the video and predict its memorability score. Our method and treat those frames as a batch representing that video. At the achieves a correlation score of 0.484 on short-term task and 0.257 beginning, we extract only 3 frames (the beginning, middle, and on long-term task in the official test set. last frames) for processing. After several tests we figure out that we can achieve higher results with more frames extracted. However, 1 INTRODUCTION we end up with the decision of using 8 frames rather than a greater number. Indeed, the correlation was not substantially better and The Predicting Media Memorability task’s main objective is to we want a straightforward extracting process. The length of each automatically predict a score which indicates how memorable a video in the dataset is 7 seconds. We get the very first frame of the video will be [2]. Video’s memorability can be affected by several video, then after each second, one more frame is captured, so finally factors such as: semantics, color feature, saliency, etc. for each video we have 8 frames. In this paper, we examine the sequential structure of videos with We decide to use pre-trained Inception-v3 Convolutional Neural LSTM. We take advantage of deep convolutional neural networks Network [10] to extract the frames’ features as we want a concise to get image features as our main source of data for predicting network which can conduct a reasonably high accuracy. We use the video memorability. In our approach, there are two main stages: publicly available model pretrained on ImageNet [7] and extract the (i) extract image features through multiple frames of a video, (ii) output with a dimensionality of 2048 from the last fully connected predict its memorability score. layer with average pooling. In the first stage, we sample 8 frames from each video. These Predicting memorability: We consider several approaches re- frames are then fed into a pretrained Inception-v3 convolution net- garding image and video memorability. In our attempts at adapting work [10] to extract their 2048-dimension features. Once extracted, IM to VM, we simply use only the middle frame of each video and each of the video frames sequentially becomes an input of a re- train two models with them as input data. We implemented a simple current neural network with one LSTM layer in the second stage. model which consists of a CNN for feature extraction and 2 fully The memorability score corresponds to the output of the last dense connected (FC) layers for computing output score. We also retrain layer for the last sequence’s input, i.e., the video’s final frame. the model in [4] with those images to see if their model generalizes well to the task’s dataset. We did not have enough time to try the 2 RELATED WORK approach in [9]. The task of predicting image memorability (IM) has made significant Videos’ captioning features are also tested by using provided progress since the release of MIT’s large-scale image memorability captions from the dataset. These captions accurately represent dataset and their MemNet [1]. Recently, in 2018, Fajtl et. al. [4] pro- the videos in terms of semantics. Moreover, all videos are short posed a method, which benefits from deep learning, visual attention, and mostly single scene. Therefore, we use only 1 caption per and recurrent networks, and achieved nearly human consistency video instead of generating each for every frame. A vector of 300 level in predicting memorability on this dataset. In [9], the authors’ dimensions is extracted from each video’s caption, which has been deep learning approach has even surpassed human consistency preprocessed, using the pretrained word2vec model [6]. We feed level with ρ = 0.72. these vectors into our caption-only LSTM and the obtained results On the other hand, several attempts have been made in annotat- are shown in Table 1. ing and predicting video memorability (VM) [3, 5, 8]. In the latter Furthermore, we propose to use a LSTM model to predict VM two methods, their results both agree that video semantics, from score using features extracted above (figure 1). Each extracted fea- captioning features in particular, give the best performance overall. ture vector of every frame of a video is an input of a time step in our In our work, we explore the effect of videos’ sequential aspect on LSTM model. At the last step, a dense layer takes a 1024-dimension memorability by using LSTM on visual features. To our knowledge, output vector of LSTM model and calculates the memorability score of that video. Copyright held by the owner/author(s). MediaEval’18, 29-31 October 2018, Sophia Antipolis, France For the short-term task, three out of five submitted runs are the results of our proposed method with three different configurations MediaEval’18, 29-31 October 2018, Sophia Antipolis, France Duy-Tue, Le-Vu, Minh-Triet and ρ = 0.24 − 0.26 on the validation subset of the short-term and long-term tasks respectively. Discussion: The dataset from [2] on short-term memorability does follow a common trend previously stated in [1]. Videos with contents of natural scenes, landscapes, backgrounds, and exteriors Figure 1: The proposed method for predicting memorability tend to be less memorable. On the other hand, videos with scenes scores of videos using deep features and LSTM. that have people, interiors, and human-made objects are easily remembered. (512, 1024, and 2048 hidden units). The remaining two are outputs of the retrained AMNet in [4] because we also want to test its performance on the task’s dataset. For the long-term task, the first run stands for our proposed method trained with long-term labels. The second run is accomplished by training our model using short- term labels and validating it by long-term labels. The next run is the result of retraining AMNet. Two final runs are actually the predicted results of two previous checkpoints in short-term task. 4 RESULTS AND DISCUSSION Figure 2: Predicted results from our models for long-term In this section, we evaluate our LSTM model on the task’s dataset. task (top) and short-term task (bottom). The images are We present our quantitative results as well as some insight that we sorted from the most memorable (left) to the least memo- learned from this dataset. rable (right). Evaluation: Since we do not have the ground truth of the of- ficial test set, in order to compare these methods, we divide the development set into 3 parts: 6,000 videos for training, 1,000 videos On the contrary, we think predicting long-term memorability for validating, and 1,000 videos for testing. Table 1 shows the results on this dataset requires more in-depth research. For all of our tried of different methods that we tested with our 1,000 test videos as methods, the results are always better when training/validating well as the task’s official test set. with short-term labels. Long-term labels seem to confuse the model With our approach of using sequential visual features of videos which leads to worse performance. One possible reason that can with LSTM, the model with 1024 hidden units achieves the best score cause the inconsistency in this particular dataset is that there exists of ρ = 0.501 on 1,000 test videos mentioned above and ρ = 0.484 multiple similar videos with opposite scores about or of specific on the official test set for the short-term task; while for the long- objects. term task, the model which was trained on short-term labels and validated on long-term labels gets ρ = 0.261 and ρ = 0.257 respec- tively. Meanwhile, if we use only long-term labels, our method gets ρ = 0.214 on the official test set. Table 1: Spearman’s rank correlation results of different methods on dataset from [2]. Figure 3: Similar videos can cause confusion to visual-based ρ Task Model model in long-term memorability. Long-term scores: 0.727 1,000 test videos Official test set (left), 0.273 (right). Our method (2048 units) 0.532 0.480 Our method (1024 units) 0.511 0.484 Short-term AMNet [4] (without attention) 0.480 0.447 AMNet [4] (with attention) 0.487 0.455 As in figure 3, both videos are almost identical in terms of vi- Our method (512 units) 0.525 0.478 sual features such as color, angle, and actor. These videos might Our method (long-term labels) 0.256 0.214 cause participants to make mistake when deciding whether they Our method* 0.261 0.257 Long-term AMNet [4] (attention + long-term labels) 0.252 0.194 had watched it or not. Hence, their long-term labels give opposite Our method (2048 units, short-term labels) 0.272 0.251 results. Our method (1024 units, short-term labels) 0.266 0.252 * 1024 units, model is trained with short-term labels and validated by long- 5 CONCLUSION AND FUTURE WORK term labels. In our approach, we focus on the temporal aspect of videos by using their frames in a LSTM recurrent network. We have not tried using In order to prevent overfitting while training, we apply a dropout a combination of features in the process, hence, we will try using rate of 0.5 on LSTM layer. We found that this rate gives the best multiple aspects of a video to measure its performance. results among 3 dropout rates of 0.25, 0.5, 0.75. The model also Acknowledgments: We would like to thank SE-AI Lab, VNU- starts overfitting as it reaches its peak at around ρ = 0.50 − 0.52 HCMUS for their precious support. Predicting Media Memorability Task MediaEval’18, 29-31 October 2018, Sophia Antipolis, France REFERENCES [1] Antonio Torralba Aditya Khosla, Akhil S. Raju and Aude Oliva. 2015. Understanding and Predicting Image Memorability at a Large Scale. In 2015 International Conference on Computer Vision (ICCV). 2390–2398. [2] Romain Cohendet, Claire-Hélène Demarty, Ngoc Q. K. Duong, Mats Sjöberg, Bogdan Ionescu, and Thanh-Toan Do. 2018. MediaEval 2018: Predicting Media Memorability Task. In Proc. of the MediaEval 2018 Workshop, 29-31 October 2018, Sophia Antipolis, France. [3] Romain Cohendet, Karthik Yadati, Ngoc Q. K. Duong, and Claire- Hélène Demarty. 2018. Annotating, Understanding, and Predicting Long-term Video Memorability. In Proceedings of the 2018 International Conference on Multimedia Retrieval, Yokohama, Japan. 178–186. [4] Jiri Fajtl, Vasileios Argyriou, Dorothy Monekosso, and Paolo Re- magnino. 2018. AMNet: Memorability Estimation with Attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6363–6372. [5] Junwei Han, Changyuan Chen, Ling Shao, Xintao Hu, Jungong Han, and Tianming Liu. 2015. Learning Computational Models of Video Memorability from fMRI Brain Imaging. IEEE Trans. Cybernetics 45, 8 (2015), 1692–1703. [6] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Ef- ficient Estimation of Word Representations in Vector Space. CoRR abs/1301.3781 (2013). [7] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) (2015). [8] Sumit Shekhar, Dhruv Singal, Harvineet Singh, Manav Kedia, and Akhil Shetty. 2017. Show and Recall: Learning What Makes Videos Memorable. In 2017 IEEE International Conference on Computer Vision Workshops, ICCV Workshops, Venice, Italy. 2730–2739. [9] Hammad Squalli-Houssaini, Ngoc Q. K. Duong, Marquant Gwenaelle, and Claire-Hélène Demarty. 2018. Deep Learning for Predicting Im- age Memorability. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, Calgary, AB, Canada. 2371–2375. [10] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2016. Rethinking the Inception Architecture for Computer Vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Las Vegas, NV, USA. 2818–2826. [11] Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. In 2015 IEEE International Conference on Computer Vision, ICCV, Santiago, Chile. 4489–4497.