Multimodal Approach to Predicting Media Memorability
                              Tanmayee Joshi, Sarath Sivaprasad, Savita Bhat, Niranjan Pedanekar
                                                                    TCS Research, Pune, India
                                                             tanmayee.joshi@tcs.com,sarath.s7@tcs.com

ABSTRACT                                                                           2.1    Visual Features
In this paper, we present a multimodal approach to modelling me-                   We used image level features based on: color, saliency, aesthetics,
dia memorability for the "Predicting Media memorability" task at                   memorability and presence of human faces. Image features were
MediaEval 2018. Our approach uses video and image based features                   calculated on frame number 0, 56 and 112. Except for aesthetics, we
along with provided textual description to predict a probability-like              computed mean and standard deviation of these features across the
memorability score for each of the seven second audioless video                    three frames to give feature representation for a video clip. C3D
clips. We use the same set of features for predicting both short-term              features were used to represent spatiotemporal aspect of a video.
and long-term media memorability.                                                  Color: Color and its distribution have a significant influence on
                                                                                   human cognition [6, 15]. Color information was captured using
                                                                                   3D HSV feature [4] and colorfulness [7]. The statistics over these
                                                                                   vectors provided a 128 dimensional vector per video.
1    INTRODUCTION                                                                  Saliency: [3, 11] observed that saliency feature is relevant for pre-
With the dramatic surge of visual media content on platforms like                  dicting memorability. For every image, saliency map was created
Instagram, Flickr and YouTube, it is imperative that new methods                   using pre-trained image saliency net (Salnet) [10]. We hypothesized
for curating, annotating and organizing this content be explored.                  that the intensity distribution of saliency inside the image and its
To this effect, non-traditional metrics for tagging media content                  change across frames contribute more towards memorability than
have been examined. Previous works have used metrics such as aes-                  the spatial spread and orientation of salient pixels. We created bins
thetics[2], interestingness[5], memorability[9] to annotate and rank               from saliency maps based on the intensity of pixels, with histogram
images. The "MediaEval 2018: Predicting Media Memorability" task                   boundaries at variable lengths to accommodate the variance of pixel
[1] focuses on predicting ‘short-term’ and ‘long-term’ memorability                distribution.
for videos.                                                                        Aesthetics: Aesthetics and human judgements of memorability are
   An important aspect of human cognition is the ability to remem-                 highly correlated [5, 8]. We used median value of aesthetic visual
ber and recall photos and videos with a surprising amount of detail.               features across frames, provided with the dataset.
Interestingly, not all content is stored and recalled equally well                 Face-based Feature: Using a state-of-the-art deep learning method
[8]. Previous attempts [5, 8] at predicting image and video mem-                   [13], we computed the number of faces per keyframe. Using this
orability discuss factors affecting memorability. The experiment                   information, the dataset was divided into two parts: with faces
stated in [1] showed people, content not related to them personally,               and without faces. Running "Mann-Whitney U" test over the mem-
and recorded a varying probability of detecting a repetition of a                  orability of two populations, we found that two populations are
given video after a short/long delay. Deep learning models have                    significantly different (p-value 1.06e-31).
given promising predictions over image memorability [9, 12]. We                    Image Memorability: We hypothesized that memorability of the
propose an ensemble of deep learning models that takes into ac-                    video is affected by the memorability of the images comprising
count various properties that are correlated with memorability. We                 the video. We used MemNet proposed by [9] to get a memorability
capture these aspects by deriving respective features through text                 score per image. This score was used directly as a part of ensemble.
embedding, frames and video.                                                       C3D: We used fc7 activations of the C3D network (provided with
                                                                                   the dataset) as a feature vector to capture activity in the video. It
2    APPROACH                                                                      captures spatiotemporal information [14].
In this section, we outline our multimodal approach to model me-
dia memorability using video, image and text features. The visual
features are inspired from different properties of images such as
saliency and aesthetics. We assumed that memorability of the video
is affected by properties of images comprising the video. We also hy-
pothesized that captions provide additional cues for understanding
semantics of videos. The short-term memorability and long-term
memorability were modelled independently using the same set of
features explained in this section.


Copyright held by the owner/author(s).
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France

                                                                                                    Figure 1: Model Architecture
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                                                                      T. Joshi et al.


2.2      Text-based Features                                                       Table 1: Correlation for Ensemble model
To analyse the data with respect to captions, we divided the train
data into 4 bins. Each bin consisted of captions corresponding                                       Short-Term              Long-Term
to 4 quartiles in memorability annotation. We defined a metric               Ensemble Model         Memorability            Memorability
‘word relevance’ inspired from term frequency−inverse document                                  Pearson Spearman        Pearson Spearman
frequency (Tf-Idf ) statistic typically used in Information Retrieval.    Weighted Average        0.50        0.46        0.25       0.22
We define word relevance for a word w i in a bin j as W Ri j . Let              SVR               0.48        0.44        0.25       0.23
total number of bins be N, total number of words in a bin j be Wj ,
frequency of a word w i in a bin j be w i j , frequency of bins where
the word w i appears be bwi , and frequency of word w i in other
bins be w i jˆ.                                                          trained on image features, video features and text embeddings using
                                                                         weighted average and SVR.
                                   wi j           N   1
                       W Ri j =         (1 + log    )             (1)       We used sigmoid activation in the last layer for all networks
                                   Wj            bwi w i jˆ              so as to restrict the output to the range of 0 to 1. ReLU activation
We created a wordlist of all unique words from the video captions.       was used for all other layers. We also fine-tuned the model over
After stemming and lemmatizing, we removed all stopwords from            validation data before predicting on test data. Final predictions
the list. Words with W R value above a threshold were shortlisted as     submitted for evaluation were based on models from experiments
candidate words and their frequency in captions was considered as        2 and 3.
a feature. We believe that higher value of W R quantifies the word’s
association to a particular range of memorability. We hypothesized       4    RESULTS AND ANALYSIS
that W R increases with the relative higher frequency of a word in
a particular bin with respect to its frequency in other bins. It was     In experiment 1, the model performed poorly with a near zero
observed that words related to topics like food and toddlers fall        Spearman’s rank correlation over the validation set. This shows
in higher memorability range and generic words related to topics         that only low level image features are not sufficient to understand
such as landscape and scenery fall in the lower memorability range.      media memorability. As part of the challenge we submitted results
   We also used pre-trained GloVe embeddings1 of words to capture        of five runs based on experiments 2 and 3. Table 1 lists correlations
more information from textual description. We preprocessed the           obtained from two best performing models. As per the evaluation
caption data by removing stopwords. We created a 100 dimensional         on an unseen test set, best performing model gives Spearman’s and
word-embedding vector for each word.                                     Pearson’s correlation for short-term memorability as 0.46 and 0.50
                                                                         respectively. The correlations for long term memorability are 0.23
                                                                         and 0.25 respectively. In the run with experiment 2, we obtained
3     EXPERIMENTS
                                                                         Spearman’s rank correlation of 0.39 and 0.17 for short term and
We ranked videos by assigning a probability like score to each           long term memorability respectively. Our best submission based
video clip, treating it as a regression problem. The annotations         on experiment 3 gives an improvement of 7%.
for short term and long term memorability were skewed towards               As mentioned earlier, this model from experiment 3 used textual
higher values with mean of 0.86 (short term) and 0.78 (long term).       information along with visual features. The improvement shows
All input features were normalized and the ground-truth was kept         that the words from captions are contributing to predicting mem-
unwhitened so that the model captured the skewed output distri-          orability scores. We believe that additional textual features such
bution. We divided the given dataset into train and validation sets      as location cues, emotion cues may be useful in further improve-
in the ratio 3:1 such that the annotations in two sets have similar      ments. Mean values of our predictions of short term and long term
distribution. We explored different combinations of features for         memorability over validation data are 0.83 and 0.78 respectively.
predicting memorability.                                                 The values are close to the mean values of their respective annota-
Experiment 1: Low level features namely, colorfulness, blur[11]          tions in the training data. This shows that our model succeeds in
value, HSV histogram were concatenated and SVR was used over             capturing the skewed distribution of training data.
this vector.
Experiment 2: We concatenated the face based feature to the
3D HSV. The resultant 130 dimension vector represents the color          5    CONCLUSION AND FUTURE WORK
spread and facial information in an image. We passed this vector         This paper presents the ensemble model by team AREA66 for pre-
and features for C3D , aesthetics and saliency through dense fully       dicting media memorability. We use visual features based on image
connected layers indepedently. We ensembled these models using           and video along with textual features from given captions. The re-
their normalized correlation values on validation as coefficients for    sults show that better results are obtained by combining visual and
weighted average.                                                        textual features. On the other hand, only visual features give lowest
Experiment 3: We used word embeddings to train different neural          values for prediction correlation. We also noticed that experiments
network architectures. CNN-LSTM and Bi-LSTM models give best             with only low level image features give poor results. In future, we
correlation on training and validation data. We ensembled models         plan to explore effects of textual information on video memorability.
                                                                         Secondly, we aim to explore more sophisticated methods for utiliz-
1 https://nlp.stanford.edu/projects/glove/                               ing low level image features to improve prediction performance.
The 2018 Predicting Media Memorability Task                                     MediaEval’18, 29-31 October 2018, Sophia Antipolis, France


REFERENCES
 [1] Romain Cohendet, Claire-Hélène Demarty, Ngoc Duong, Mats Sjöberg,
     Bogdan Ionescu, and Thanh-Toan Do. 2018. MediaEval 2018: Predict-
     ing Media Memorability Task. In Proceedings of the MediaEval 2018
     Workshop. 29–31 October 2018, Sophia Antipolis, France.
 [2] Sagnik Dhar, Vicente Ordonez, and Tamara L Berg. 2011. High level
     describable attributes for predicting aesthetics and interestingness.
     In Proceedings of IEEE Conference on Computer Vision and Pattern
     Recognition. IEEE, 1657–1664.
 [3] Rachit Dubey, Joshua Peterson, Aditya Khosla, Ming-Hsuan Yang,
     and Bernard Ghanem. 2015. What makes an object memorable?. In
     Proceedings of IEEE International Conference on Computer Vision. 1089–
     1097.
 [4] Ankit Goyal, Naveen Kumar, Tanaya Guha, and Shrikanth S
     Narayanan. 2016. A multimodal mixture-of-experts model for dy-
     namic emotion prediction in movies. In Proceedings of IEEE Interna-
     tional Conference on Acoustics, Speech and Signal Processing. IEEE,
     2822–2826.
 [5] Michael Gygli, Helmut Grabner, Hayko Riemenschneider, Fabian
     Nater, and Luc Van Gool. 2013. The interestingness of images. In
     Proceedings of the IEEE International Conference on Computer Vision.
     1633–1640.
 [6] Alan Hanjalic. 2006. Extracting moods from pictures and sounds:
     Towards truly personalized TV. In IEEE Signal Processing Magazine,
     Vol. 23. IEEE, 90–100.
 [7] David Hasler and Sabine E Suesstrunk. 2003. Measuring colorfulness in
     natural images. In Human vision and electronic imaging VIII, Vol. 5007.
     International Society for Optics and Photonics, 87–96.
 [8] Phillip Isola, Jianxiong Xiao, Devi Parikh, Antonio Torralba, and Aude
     Oliva. 2014. What makes a photograph memorable?. In IEEE trans-
     actions on pattern analysis and machine intelligence, Vol. 36. IEEE,
     1469–1482.
 [9] Aditya Khosla, Akhil S Raju, Antonio Torralba, and Aude Oliva. 2015.
     Understanding and predicting image memorability at a large scale. In
     Proceedings of the IEEE International Conference on Computer Vision.
     2390–2398.
[10] Junting Pan, Elisa Sayrol, Xavier Giro-i Nieto, Kevin McGuinness, and
     Noel E O’Connor. 2016. Shallow and deep convolutional networks for
     saliency prediction. In Proceedings of the IEEE Conference on Computer
     Vision and Pattern Recognition. 598–606.
[11] José Luis Pech-Pacheco, Gabriel Cristóbal, Jesús Chamorro-Martinez,
     and Joaquín Fernández-Valdivia. 2000. Diatom autofocusing in bright-
     field microscopy: a comparative study. In Proceedings of 15th Interna-
     tional Conference on Pattern Recognition, Vol. 3. IEEE, 314–317.
[12] Hammad Squalli-Houssaini, Ngoc QK Duong, Marquant Gwenaëlle,
     and Claire-Hélène Demarty. 2018. Deep learning for predicting im-
     age memorability. In Proceedings of IEEE International Conference on
     Acoustics, Speech and Signal Processing. IEEE, 2371–2375.
[13] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf.
     2014. Deepface: Closing the gap to human-level performance in face
     verification. In Proceedings of the IEEE conference on Computer Vision
     and Pattern Recognition. 1701–1708.
[14] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and
     Manohar Paluri. 2015. Learning spatiotemporal features with 3d con-
     volutional networks. In Proceedings of the IEEE international conference
     on computer vision. 4489–4497.
[15] Patricia Valdez and Albert Mehrabian. 1994. Effects of color on emo-
     tions.. In Journal of experimental psychology: General, Vol. 123. Ameri-
     can Psychological Association, 394.