=Paper=
{{Paper
|id=Vol-2283/MediaEval_18_paper_15
|storemode=property
|title=Show and Recall @ MediaEval 2018
ViMemNet: Predicting Video Memorability
|pdfUrl=https://ceur-ws.org/Vol-2283/MediaEval_18_paper_15.pdf
|volume=Vol-2283
|authors=Ritwick Chaudhry,Manoj Kilaru,Sumit Shekhar
|dblpUrl=https://dblp.org/rec/conf/mediaeval/ChaudhryKS18
}}
==Show and Recall @ MediaEval 2018
ViMemNet: Predicting Video Memorability==
Show and Recall @ MediaEval 2018 ViMemNet: Predicting Video Memorability Ritwick Chaudhry, Manoj Kilaru, Sumit Shekhar Adobe Research {rchaudhr,kilaru,sushekha}@adobe.com ABSTRACT work on recall of information about viewed visual content can be In the current age of expanding access to the Internet, there has divided into the following categories: been a flood of videos on the web. Studying the human cognitive Image Memorability: Isola et al. [12] started out the computa- factors that affect the consumption of these videos is becoming tional study revolving around the cognitive metric, memorability increasingly important, to be able to effectively organize and curate of images. The authors showed that across various subjects and them. One such important cognitive factor is Video Memorability, under wide range of contexts, memorability of an image is con- which is the ability to recall a video’s content after watching it. sistent, which indicates that image memorability is an intrinsic In this paper, we present our approach to solving the MediaEval property of images. Since then many prior works have explored 2018 Predicting Media Memorability Task. We develop a 3-forked this problem [10, 11, 14, 15, 17]. Khosla et al. [13] introduced largest pipeline for predicting Memorability Scores, which leverages the annotated image memorability dataset (containing 60,000 images visual image features (both low-level and high-level), the image from diverse sources) and showed that fine-tuned deep features saliency in different video frames, and the information present in outperform all other features by a large margin. Fajtl et al. [7] used the captions. We also explore the relevance of other features such a visual attention mechanism and designed an end-to-end train- as image memorability scores of the different frames in the video, able deep neural network for estimating memorability. Siarohin et and present a detailed analysis of the results. al. [21] adopted a deep architecture for generating a memorable picture from a given input image and a style seed. 1 INTRODUCTION Video Memorability: Han et al. [8] commenced computational With the explosion of visual content on the Internet, it is becoming studies on memorability of videos by learning from brain functional increasingly important to discover new cognitive metrics to analyze magnetic resonance imaging. As the method used fMRI measure- the content. Memorability of visual content is one such metric. ments of the users for learning the model, it would be difficult to Previous studies on memorability [3] suggest that even though generalize. The authors in [5, 20] used spatio-temporal features we come across a plethora of photos and images each day, our to represent video dynamics and used a regression framework for long-term memory is capable of storing massive number of objects predicting memorability. with details, from images that we have come across. Although We extend their work by proposing a trainable deep learning memorability of visual content is affected by personal context and framework for predicting Video Memorability scores. subjective consumption [9], it has been shown [2, 11] that there is 3 APPROACH a high degree of consistency amongst people in the ability to retain information. This makes memorability an objective target. In this section, we discuss the task of predicting Video Memorability. Recent efforts in trying to predict the memorability of images The feature extraction from videos is described in Section 3.1 and have been successful, with the development of a large scale dataset an analysis of features for memorability prediction is discussed in on image memorability [13]. In [13], near human consistency rank Section 3.2 correlation for image memorability is achieved, thereby establishing that human cognitive abilities are within reach for the field of computer vision. Despite these efforts in the realm of images, there has been limited work in predicting memorability of videos, given the added complexities that videos bring in. Therefore, we seek to analyze the task of predicting memorability scores for videos in the context of MediaEval 2018. 2 RELATED WORK The concept of memorability has been studied in psychology and neuroscience studies. They mostly focused on visual memory, study- ing for instance the human capacity of remembering object de- tails [3], effect of stimuli on encoding and later retrieval from mem- ory [1], memory systems of the brain [18, 19] etc. Broadly, prior Copyright held by the owner/author(s). MediaEval’18, 29-31 October 2018, Sophia Antipolis, France Figure 1: Our proposed model architecture MediaEval’18, 29-31 October 2018, Sophia Antipolis, France Ritwick Chaudhry, Manoj Kilaru, Sumit Shekhar 3.1 Feature Extraction Table 1: Rank correlation and MSE on test train and valida- tion sets For Short Term memorability scores We used following features, divided into 3 groups: Group 1 (G1) Features Train Validation Test • C3D: C3D features are outputs of the final classification Spearman MSE Spearman MSE Spearman MSE layer of deep 3D convolutional networks trained on a large G1,G3 0.5897 0.0039 0.3959 - 0.3554 0.0060 scale supervised video dataset; G1,G2 0.5815 0.0042 0.341 - 0.3149 0.0062 • Color Histogram: This is computed in the HSV space G1 0.7401 0.0027 0.3222 - 0.3096 0.0068 using 64 bins in each color space for 3 key-frames (first, middle and last frames) for each video; Table 2: Rank correlation and MSE on test train and valida- • InceptionV3: This corresponds to the final class activa- tion sets For Long Term memorability scores tions of the InceptionV3 deep network for object detection, trained on the ImageNet dataset; Features Train Validation Test • Saliency: The aspect of visual content which grabs hu- Spearman MSE Spearman MSE Spearman MSE man attention has shown to be useful in predicting mem- G1,G3 0.8941 0.0065 0.2034 - 0.0878 0.0313 orability [6, 11]. We used the highest ranking saliency G1,G2 0.2882 0.0197 0.153 - 0.1399 0.0198 prediction model in the MIT Saliency Benchmark on the G1 0.2975 0.0296 0.1437 - 0.1499 0.0286 MIT300 dataset (AUC, sAUC), DeepGaze II [16] and gen- erated saliency maps for all 3 key frames; • Captions: We used textual captions present in the dataset Captions: Each word in the captions is represented using which were generated manually for describing the videos. pre-trained 100 dimensional Glove embeddings and each Captions can be a compact source of representing the embedding is passed through single layered LSTM of hidden video content, and thus can be useful for predictions. dimension 100. The final representation of this caption is appended with rest of features as shown in the Figure 1. Group 2 (G2) Other Visual Features: C3D, Color Histogram, HMP, • Image Memorability: We divided the video into 10 frames HoG, InceptionV3, Image Memorability features are con- and used image memorability scores for each frame pre- catenated and then combined with saliency and caption rep- dicted using a pre-trained model [13]. resentations. Then a five layered dense fully connected linear Group 3 (G3) neural is applied to obtain a single number representing the memorability score. The model is trained using Stochastic • HMP: Histogram of motion patterns is computed for each Gradient Descent with Mean Squared Error Loss function. video and Principal Component Analysis (PCA) (with 128 We trained different models for both Long Term and Short principal components) is applied on them to obtain a re- Term memorability scores using the aforementioned archi- duced dimensional encoding; tecture. Results are presented in Table 1 and Table 2 • HoG: HoG descriptors (Histograms of Oriented Gradients) We believe that all higher level G1 features are required are calculated on 32x32 windows of each key frame and for memorability prediction. To test whether low level fea- 256 principal components are extracted from each feature. tures (Group G3) will help in prediction, we ran experiments, including and excluding the G3 features (Table 1). We also 3.2 Model Description and Prediction Analysis experimented with using the Image Memorability scores of Here, we describe our proposed model and provide the train- sampled frames from the video. ing details. The dataset [4] consists of 8000 training videos and 2000 test videos, with each video being 7 seconds long. 4 DISCUSSION AND OUTLOOK The train data is randomly split into 80:20 split for train- In this work, we have described a robust way to model and ing the model and validation respectively. We describe our compute Video Memorability. It is empirically clear that us- 3-forked pipeline architecture (see Figure 1) for predicting ing G3 features (low level image features of keyframes) help. Memorability Scores, which leverages the aforementioned Also, including Image memorability scores of key frames visual features, image saliency and captions. didn’t lead to any improvement in performance, hinting to Saliency: The saliency maps extracted from the video the fact that videos are much more than just a set of frames, frames are down scaled to 120 by 68. A 2 layer CNN is applied and that temporal features matter. In future, we plan to con- on these maps with each layer consisting of a 2D convolu- duct the Video Memorability experiment with improved fea- tion, batch normalization, relu activation and a max pool tures like Dense Optical Flow features, Action based features operation. Finally they are vectorized and a fully connected representing the sequence of actions in the video, and also linear is applied on it. aim to leverage the audio in the videos. Show and Recall @ MediaEval 2018 ViMemNet: Predicting Video Memorability MediaEval’18, 29-31 October 2018, Sophia Antipolis, France REFERENCES [19] James L McGaugh, Ines B Introini-Collison, Larry F Cahill, Claudio [1] Wilma A Bainbridge, Daniel D Dilks, and Aude Oliva. 2017. Memora- Castellano, Carla Dalmaz, Marise B Parent, and Cedric L Williams. bility: A stimulus-driven perceptual neural signature distinctive from 1993. Neuromodulatory systems and memory storage: role of the memory. NeuroImage 149 (2017), 141–152. amygdala. Behavioural brain research 58, 1-2 (1993), 81–90. [2] Wilma A Bainbridge, Phillip Isola, and Aude Oliva. 2013. The intrinsic [20] Sumit Shekhar, Dhruv Singal, Harvineet Singh, Manav Kedia, and memorability of face photographs. Journal of Experimental Psychology: Akhil Shetty. 2017. Show and Recall: Learning What Makes Videos General 142, 4 (2013), 1323. Memorable. In Proceedings of the IEEE Conference on Computer Vision [3] Timothy F Brady, Talia Konkle, George A Alvarez, and Aude Oliva. and Pattern Recognition. 2730–2739. 2008. Visual long-term memory has a massive storage capacity for [21] Aliaksandr Siarohin, Gloria Zen, Cveta Majtanovic, Xavier Alameda- object details. Proceedings of the National Academy of Sciences 105, 38 Pineda, Elisa Ricci, and Nicu Sebe. 2017. How to Make an Image More (2008), 14325–14329. Memorable?: A Deep Style Transfer Approach. In Proceedings of the [4] Romain Cohendet, Claire-Hélène Demarty, Ngoc Q. K. Duong, Mats 2017 ACM on International Conference on Multimedia Retrieval. ACM, Sjöberg, Bogdan Ionescu, and Thanh-Toan Do. MediaEval 2018: Pre- 322–329. dicting Media Memorability Task. In The Proceedings of MediaEval 2018 Workshop, 29-31 October 2018, Sophia Antipolis, France. [5] Romain Cohendet, Karthik Yadati, Ngoc QK Duong, and Claire-Hélène Demarty. 2018. Annotating, Understanding, and Predicting Long-term Video Memorability. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. ACM, 178–186. [6] Rachit Dubey, Joshua Peterson, Aditya Khosla, Ming-Hsuan Yang, and Bernard Ghanem. 2015. What makes an object memorable?. In Proceedings of the ieee international conference on computer vision. 1089–1097. [7] Jiri Fajtl, Vasileios Argyriou, Dorothy Monekosso, and Paolo Re- magnino. 2018. AMNet: Memorability Estimation with Attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6363–6372. [8] Junwei Han, Changyuan Chen, Ling Shao, Xintao Hu, Jungong Han, and Tianming Liu. 2015. Learning computational models of video memorability from fMRI brain imaging. IEEE transactions on Cyber- netics 45, 8 (2015), 1692–1703. [9] R Reed Hunt and James B Worthen. 2006. Distinctiveness and memory. Oxford University Press. [10] Phillip Isola, Devi Parikh, Antonio Torralba, and Aude Oliva. 2011. Understanding the intrinsic memorability of images. In Advances in Neural Information Processing Systems. 2429–2437. [11] Phillip Isola, Jianxiong Xiao, Devi Parikh, Antonio Torralba, and Aude Oliva. 2013. What makes a photograph memorable? IEEE Transactions on Pattern Analysis & Machine Intelligence 1 (2013), 1. [12] Phillip Isola, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. 2011. What makes an image memorable? (2011). [13] Aditya Khosla, Akhil S. Raju, Antonio Torralba, and Aude Oliva. 2015. Understanding and Predicting Image Memorability at a Large Scale. In International Conference on Computer Vision (ICCV). [14] Aditya Khosla, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. 2012. Memorability of image regions. In Advances in Neural Information Processing Systems. 296–304. [15] Jongpil Kim, Sejong Yoon, and Vladimir Pavlovic. 2013. Relative spatial features for image memorability. In Proceedings of the 21st ACM international conference on Multimedia. ACM, 761–764. [16] Matthias Kummerer, Thomas S. A. Wallis, Leon A. Gatys, and Matthias Bethge. 2017. Understanding Low- and High-Level Contributions to Fixation Prediction. In The IEEE International Conference on Computer Vision (ICCV). [17] Matei Mancas and Olivier Le Meur. 2013. Memorability of natural scenes: The role of attention. In Image Processing (ICIP), 2013 20th IEEE International Conference on. IEEE, 196–200. [18] James L McGaugh, Larry Cahill, and Benno Roozendaal. 1996. Involve- ment of the amygdala in memory storage: interaction with other brain systems. Proceedings of the National Academy of Sciences 93, 24 (1996), 13508–13514.