=Paper= {{Paper |id=Vol-2283/MediaEval_18_paper_15 |storemode=property |title=Show and Recall @ MediaEval 2018 ViMemNet: Predicting Video Memorability |pdfUrl=https://ceur-ws.org/Vol-2283/MediaEval_18_paper_15.pdf |volume=Vol-2283 |authors=Ritwick Chaudhry,Manoj Kilaru,Sumit Shekhar |dblpUrl=https://dblp.org/rec/conf/mediaeval/ChaudhryKS18 }} ==Show and Recall @ MediaEval 2018 ViMemNet: Predicting Video Memorability== https://ceur-ws.org/Vol-2283/MediaEval_18_paper_15.pdf
                                Show and Recall @ MediaEval 2018
                             ViMemNet: Predicting Video Memorability
                                                Ritwick Chaudhry, Manoj Kilaru, Sumit Shekhar
                                                                         Adobe Research
                                                             {rchaudhr,kilaru,sushekha}@adobe.com
ABSTRACT                                                                         work on recall of information about viewed visual content can be
In the current age of expanding access to the Internet, there has                divided into the following categories:
been a flood of videos on the web. Studying the human cognitive                     Image Memorability: Isola et al. [12] started out the computa-
factors that affect the consumption of these videos is becoming                  tional study revolving around the cognitive metric, memorability
increasingly important, to be able to effectively organize and curate            of images. The authors showed that across various subjects and
them. One such important cognitive factor is Video Memorability,                 under wide range of contexts, memorability of an image is con-
which is the ability to recall a video’s content after watching it.              sistent, which indicates that image memorability is an intrinsic
In this paper, we present our approach to solving the MediaEval                  property of images. Since then many prior works have explored
2018 Predicting Media Memorability Task. We develop a 3-forked                   this problem [10, 11, 14, 15, 17]. Khosla et al. [13] introduced largest
pipeline for predicting Memorability Scores, which leverages the                 annotated image memorability dataset (containing 60,000 images
visual image features (both low-level and high-level), the image                 from diverse sources) and showed that fine-tuned deep features
saliency in different video frames, and the information present in               outperform all other features by a large margin. Fajtl et al. [7] used
the captions. We also explore the relevance of other features such               a visual attention mechanism and designed an end-to-end train-
as image memorability scores of the different frames in the video,               able deep neural network for estimating memorability. Siarohin et
and present a detailed analysis of the results.                                  al. [21] adopted a deep architecture for generating a memorable
                                                                                 picture from a given input image and a style seed.
1    INTRODUCTION                                                                   Video Memorability: Han et al. [8] commenced computational
With the explosion of visual content on the Internet, it is becoming             studies on memorability of videos by learning from brain functional
increasingly important to discover new cognitive metrics to analyze              magnetic resonance imaging. As the method used fMRI measure-
the content. Memorability of visual content is one such metric.                  ments of the users for learning the model, it would be difficult to
Previous studies on memorability [3] suggest that even though                    generalize. The authors in [5, 20] used spatio-temporal features
we come across a plethora of photos and images each day, our                     to represent video dynamics and used a regression framework for
long-term memory is capable of storing massive number of objects                 predicting memorability.
with details, from images that we have come across. Although                        We extend their work by proposing a trainable deep learning
memorability of visual content is affected by personal context and               framework for predicting Video Memorability scores.
subjective consumption [9], it has been shown [2, 11] that there is              3    APPROACH
a high degree of consistency amongst people in the ability to retain
information. This makes memorability an objective target.                        In this section, we discuss the task of predicting Video Memorability.
   Recent efforts in trying to predict the memorability of images                The feature extraction from videos is described in Section 3.1 and
have been successful, with the development of a large scale dataset              an analysis of features for memorability prediction is discussed in
on image memorability [13]. In [13], near human consistency rank                 Section 3.2
correlation for image memorability is achieved, thereby establishing
that human cognitive abilities are within reach for the field of
computer vision. Despite these efforts in the realm of images, there
has been limited work in predicting memorability of videos, given
the added complexities that videos bring in.
   Therefore, we seek to analyze the task of predicting memorability
scores for videos in the context of MediaEval 2018.
2    RELATED WORK
The concept of memorability has been studied in psychology and
neuroscience studies. They mostly focused on visual memory, study-
ing for instance the human capacity of remembering object de-
tails [3], effect of stimuli on encoding and later retrieval from mem-
ory [1], memory systems of the brain [18, 19] etc. Broadly, prior

Copyright held by the owner/author(s).
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France
                                                                                           Figure 1: Our proposed model architecture
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                        Ritwick Chaudhry, Manoj Kilaru, Sumit Shekhar


3.1   Feature Extraction                                        Table 1: Rank correlation and MSE on test train and valida-
                                                                tion sets For Short Term memorability scores
We used following features, divided into 3 groups:
Group 1 (G1)
                                                                    Features          Train          Validation            Test
• C3D: C3D features are outputs of the final classification                    Spearman MSE Spearman MSE            Spearman MSE
  layer of deep 3D convolutional networks trained on a large        G1,G3        0.5897     0.0039 0.3959       -     0.3554    0.0060
  scale supervised video dataset;                                   G1,G2        0.5815     0.0042  0.341       -     0.3149    0.0062
• Color Histogram: This is computed in the HSV space                G1           0.7401     0.0027 0.3222       -     0.3096    0.0068
  using 64 bins in each color space for 3 key-frames (first,
  middle and last frames) for each video;                       Table 2: Rank correlation and MSE on test train and valida-
• InceptionV3: This corresponds to the final class activa-      tion sets For Long Term memorability scores
  tions of the InceptionV3 deep network for object detection,
  trained on the ImageNet dataset;                                  Features          Train          Validation            Test
• Saliency: The aspect of visual content which grabs hu-                       Spearman MSE Spearman MSE            Spearman MSE
  man attention has shown to be useful in predicting mem-           G1,G3        0.8941     0.0065 0.2034       -     0.0878    0.0313
  orability [6, 11]. We used the highest ranking saliency           G1,G2        0.2882     0.0197  0.153       -     0.1399    0.0198
  prediction model in the MIT Saliency Benchmark on the             G1           0.2975     0.0296 0.1437       -     0.1499    0.0286
  MIT300 dataset (AUC, sAUC), DeepGaze II [16] and gen-
  erated saliency maps for all 3 key frames;
• Captions: We used textual captions present in the dataset        Captions: Each word in the captions is represented using
  which were generated manually for describing the videos.      pre-trained 100 dimensional Glove embeddings and each
  Captions can be a compact source of representing the          embedding is passed through single layered LSTM of hidden
  video content, and thus can be useful for predictions.        dimension 100. The final representation of this caption is
                                                                appended with rest of features as shown in the Figure 1.
Group 2 (G2)
                                                                   Other Visual Features: C3D, Color Histogram, HMP,
• Image Memorability: We divided the video into 10 frames       HoG, InceptionV3, Image Memorability features are con-
  and used image memorability scores for each frame pre-        catenated and then combined with saliency and caption rep-
  dicted using a pre-trained model [13].                        resentations. Then a five layered dense fully connected linear
Group 3 (G3)                                                    neural is applied to obtain a single number representing the
                                                                memorability score. The model is trained using Stochastic
• HMP: Histogram of motion patterns is computed for each        Gradient Descent with Mean Squared Error Loss function.
  video and Principal Component Analysis (PCA) (with 128           We trained different models for both Long Term and Short
  principal components) is applied on them to obtain a re-      Term memorability scores using the aforementioned archi-
  duced dimensional encoding;                                   tecture. Results are presented in Table 1 and Table 2
• HoG: HoG descriptors (Histograms of Oriented Gradients)          We believe that all higher level G1 features are required
  are calculated on 32x32 windows of each key frame and         for memorability prediction. To test whether low level fea-
  256 principal components are extracted from each feature.     tures (Group G3) will help in prediction, we ran experiments,
                                                                including and excluding the G3 features (Table 1). We also
3.2   Model Description and Prediction Analysis
                                                                experimented with using the Image Memorability scores of
Here, we describe our proposed model and provide the train-     sampled frames from the video.
ing details. The dataset [4] consists of 8000 training videos
and 2000 test videos, with each video being 7 seconds long.     4     DISCUSSION AND OUTLOOK
The train data is randomly split into 80:20 split for train-    In this work, we have described a robust way to model and
ing the model and validation respectively. We describe our      compute Video Memorability. It is empirically clear that us-
3-forked pipeline architecture (see Figure 1) for predicting    ing G3 features (low level image features of keyframes) help.
Memorability Scores, which leverages the aforementioned         Also, including Image memorability scores of key frames
visual features, image saliency and captions.                   didn’t lead to any improvement in performance, hinting to
   Saliency: The saliency maps extracted from the video         the fact that videos are much more than just a set of frames,
frames are down scaled to 120 by 68. A 2 layer CNN is applied   and that temporal features matter. In future, we plan to con-
on these maps with each layer consisting of a 2D convolu-       duct the Video Memorability experiment with improved fea-
tion, batch normalization, relu activation and a max pool       tures like Dense Optical Flow features, Action based features
operation. Finally they are vectorized and a fully connected    representing the sequence of actions in the video, and also
linear is applied on it.                                        aim to leverage the audio in the videos.
Show and Recall @ MediaEval 2018
ViMemNet: Predicting Video Memorability                                                MediaEval’18, 29-31 October 2018, Sophia Antipolis, France

REFERENCES                                                                       [19] James L McGaugh, Ines B Introini-Collison, Larry F Cahill, Claudio
 [1] Wilma A Bainbridge, Daniel D Dilks, and Aude Oliva. 2017. Memora-                Castellano, Carla Dalmaz, Marise B Parent, and Cedric L Williams.
     bility: A stimulus-driven perceptual neural signature distinctive from           1993. Neuromodulatory systems and memory storage: role of the
     memory. NeuroImage 149 (2017), 141–152.                                          amygdala. Behavioural brain research 58, 1-2 (1993), 81–90.
 [2] Wilma A Bainbridge, Phillip Isola, and Aude Oliva. 2013. The intrinsic      [20] Sumit Shekhar, Dhruv Singal, Harvineet Singh, Manav Kedia, and
     memorability of face photographs. Journal of Experimental Psychology:            Akhil Shetty. 2017. Show and Recall: Learning What Makes Videos
     General 142, 4 (2013), 1323.                                                     Memorable. In Proceedings of the IEEE Conference on Computer Vision
 [3] Timothy F Brady, Talia Konkle, George A Alvarez, and Aude Oliva.                 and Pattern Recognition. 2730–2739.
     2008. Visual long-term memory has a massive storage capacity for            [21] Aliaksandr Siarohin, Gloria Zen, Cveta Majtanovic, Xavier Alameda-
     object details. Proceedings of the National Academy of Sciences 105, 38          Pineda, Elisa Ricci, and Nicu Sebe. 2017. How to Make an Image More
     (2008), 14325–14329.                                                             Memorable?: A Deep Style Transfer Approach. In Proceedings of the
 [4] Romain Cohendet, Claire-Hélène Demarty, Ngoc Q. K. Duong, Mats                   2017 ACM on International Conference on Multimedia Retrieval. ACM,
     Sjöberg, Bogdan Ionescu, and Thanh-Toan Do. MediaEval 2018: Pre-                 322–329.
     dicting Media Memorability Task. In The Proceedings of MediaEval
     2018 Workshop, 29-31 October 2018, Sophia Antipolis, France.
 [5] Romain Cohendet, Karthik Yadati, Ngoc QK Duong, and Claire-Hélène
     Demarty. 2018. Annotating, Understanding, and Predicting Long-term
     Video Memorability. In Proceedings of the 2018 ACM on International
     Conference on Multimedia Retrieval. ACM, 178–186.
 [6] Rachit Dubey, Joshua Peterson, Aditya Khosla, Ming-Hsuan Yang,
     and Bernard Ghanem. 2015. What makes an object memorable?. In
     Proceedings of the ieee international conference on computer vision.
     1089–1097.
 [7] Jiri Fajtl, Vasileios Argyriou, Dorothy Monekosso, and Paolo Re-
     magnino. 2018. AMNet: Memorability Estimation with Attention.
     In Proceedings of the IEEE Conference on Computer Vision and Pattern
     Recognition. 6363–6372.
 [8] Junwei Han, Changyuan Chen, Ling Shao, Xintao Hu, Jungong Han,
     and Tianming Liu. 2015. Learning computational models of video
     memorability from fMRI brain imaging. IEEE transactions on Cyber-
     netics 45, 8 (2015), 1692–1703.
 [9] R Reed Hunt and James B Worthen. 2006. Distinctiveness and memory.
     Oxford University Press.
[10] Phillip Isola, Devi Parikh, Antonio Torralba, and Aude Oliva. 2011.
     Understanding the intrinsic memorability of images. In Advances in
     Neural Information Processing Systems. 2429–2437.
[11] Phillip Isola, Jianxiong Xiao, Devi Parikh, Antonio Torralba, and Aude
     Oliva. 2013. What makes a photograph memorable? IEEE Transactions
     on Pattern Analysis & Machine Intelligence 1 (2013), 1.
[12] Phillip Isola, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. 2011.
     What makes an image memorable? (2011).
[13] Aditya Khosla, Akhil S. Raju, Antonio Torralba, and Aude Oliva. 2015.
     Understanding and Predicting Image Memorability at a Large Scale.
     In International Conference on Computer Vision (ICCV).
[14] Aditya Khosla, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. 2012.
     Memorability of image regions. In Advances in Neural Information
     Processing Systems. 296–304.
[15] Jongpil Kim, Sejong Yoon, and Vladimir Pavlovic. 2013. Relative
     spatial features for image memorability. In Proceedings of the 21st ACM
     international conference on Multimedia. ACM, 761–764.
[16] Matthias Kummerer, Thomas S. A. Wallis, Leon A. Gatys, and Matthias
     Bethge. 2017. Understanding Low- and High-Level Contributions to
     Fixation Prediction. In The IEEE International Conference on Computer
     Vision (ICCV).
[17] Matei Mancas and Olivier Le Meur. 2013. Memorability of natural
     scenes: The role of attention. In Image Processing (ICIP), 2013 20th IEEE
     International Conference on. IEEE, 196–200.
[18] James L McGaugh, Larry Cahill, and Benno Roozendaal. 1996. Involve-
     ment of the amygdala in memory storage: interaction with other brain
     systems. Proceedings of the National Academy of Sciences 93, 24 (1996),
     13508–13514.