Multimodal Approach to Predicting Media Memorability Tanmayee Joshi, Sarath Sivaprasad, Savita Bhat, Niranjan Pedanekar TCS Research, Pune, India tanmayee.joshi@tcs.com,sarath.s7@tcs.com ABSTRACT 2.1 Visual Features In this paper, we present a multimodal approach to modelling me- We used image level features based on: color, saliency, aesthetics, dia memorability for the "Predicting Media memorability" task at memorability and presence of human faces. Image features were MediaEval 2018. Our approach uses video and image based features calculated on frame number 0, 56 and 112. Except for aesthetics, we along with provided textual description to predict a probability-like computed mean and standard deviation of these features across the memorability score for each of the seven second audioless video three frames to give feature representation for a video clip. C3D clips. We use the same set of features for predicting both short-term features were used to represent spatiotemporal aspect of a video. and long-term media memorability. Color: Color and its distribution have a significant influence on human cognition [6, 15]. Color information was captured using 3D HSV feature [4] and colorfulness [7]. The statistics over these vectors provided a 128 dimensional vector per video. 1 INTRODUCTION Saliency: [3, 11] observed that saliency feature is relevant for pre- With the dramatic surge of visual media content on platforms like dicting memorability. For every image, saliency map was created Instagram, Flickr and YouTube, it is imperative that new methods using pre-trained image saliency net (Salnet) [10]. We hypothesized for curating, annotating and organizing this content be explored. that the intensity distribution of saliency inside the image and its To this effect, non-traditional metrics for tagging media content change across frames contribute more towards memorability than have been examined. Previous works have used metrics such as aes- the spatial spread and orientation of salient pixels. We created bins thetics[2], interestingness[5], memorability[9] to annotate and rank from saliency maps based on the intensity of pixels, with histogram images. The "MediaEval 2018: Predicting Media Memorability" task boundaries at variable lengths to accommodate the variance of pixel [1] focuses on predicting ‘short-term’ and ‘long-term’ memorability distribution. for videos. Aesthetics: Aesthetics and human judgements of memorability are An important aspect of human cognition is the ability to remem- highly correlated [5, 8]. We used median value of aesthetic visual ber and recall photos and videos with a surprising amount of detail. features across frames, provided with the dataset. Interestingly, not all content is stored and recalled equally well Face-based Feature: Using a state-of-the-art deep learning method [8]. Previous attempts [5, 8] at predicting image and video mem- [13], we computed the number of faces per keyframe. Using this orability discuss factors affecting memorability. The experiment information, the dataset was divided into two parts: with faces stated in [1] showed people, content not related to them personally, and without faces. Running "Mann-Whitney U" test over the mem- and recorded a varying probability of detecting a repetition of a orability of two populations, we found that two populations are given video after a short/long delay. Deep learning models have significantly different (p-value 1.06e-31). given promising predictions over image memorability [9, 12]. We Image Memorability: We hypothesized that memorability of the propose an ensemble of deep learning models that takes into ac- video is affected by the memorability of the images comprising count various properties that are correlated with memorability. We the video. We used MemNet proposed by [9] to get a memorability capture these aspects by deriving respective features through text score per image. This score was used directly as a part of ensemble. embedding, frames and video. C3D: We used fc7 activations of the C3D network (provided with the dataset) as a feature vector to capture activity in the video. It 2 APPROACH captures spatiotemporal information [14]. In this section, we outline our multimodal approach to model me- dia memorability using video, image and text features. The visual features are inspired from different properties of images such as saliency and aesthetics. We assumed that memorability of the video is affected by properties of images comprising the video. We also hy- pothesized that captions provide additional cues for understanding semantics of videos. The short-term memorability and long-term memorability were modelled independently using the same set of features explained in this section. Copyright held by the owner/author(s). MediaEval’18, 29-31 October 2018, Sophia Antipolis, France Figure 1: Model Architecture MediaEval’18, 29-31 October 2018, Sophia Antipolis, France T. Joshi et al. 2.2 Text-based Features Table 1: Correlation for Ensemble model To analyse the data with respect to captions, we divided the train data into 4 bins. Each bin consisted of captions corresponding Short-Term Long-Term to 4 quartiles in memorability annotation. We defined a metric Ensemble Model Memorability Memorability ‘word relevance’ inspired from term frequency−inverse document Pearson Spearman Pearson Spearman frequency (Tf-Idf ) statistic typically used in Information Retrieval. Weighted Average 0.50 0.46 0.25 0.22 We define word relevance for a word w i in a bin j as W Ri j . Let SVR 0.48 0.44 0.25 0.23 total number of bins be N, total number of words in a bin j be Wj , frequency of a word w i in a bin j be w i j , frequency of bins where the word w i appears be bwi , and frequency of word w i in other bins be w i jˆ. trained on image features, video features and text embeddings using weighted average and SVR. wi j N 1 W Ri j = (1 + log ) (1) We used sigmoid activation in the last layer for all networks Wj bwi w i jˆ so as to restrict the output to the range of 0 to 1. ReLU activation We created a wordlist of all unique words from the video captions. was used for all other layers. We also fine-tuned the model over After stemming and lemmatizing, we removed all stopwords from validation data before predicting on test data. Final predictions the list. Words with W R value above a threshold were shortlisted as submitted for evaluation were based on models from experiments candidate words and their frequency in captions was considered as 2 and 3. a feature. We believe that higher value of W R quantifies the word’s association to a particular range of memorability. We hypothesized 4 RESULTS AND ANALYSIS that W R increases with the relative higher frequency of a word in a particular bin with respect to its frequency in other bins. It was In experiment 1, the model performed poorly with a near zero observed that words related to topics like food and toddlers fall Spearman’s rank correlation over the validation set. This shows in higher memorability range and generic words related to topics that only low level image features are not sufficient to understand such as landscape and scenery fall in the lower memorability range. media memorability. As part of the challenge we submitted results We also used pre-trained GloVe embeddings1 of words to capture of five runs based on experiments 2 and 3. Table 1 lists correlations more information from textual description. We preprocessed the obtained from two best performing models. As per the evaluation caption data by removing stopwords. We created a 100 dimensional on an unseen test set, best performing model gives Spearman’s and word-embedding vector for each word. Pearson’s correlation for short-term memorability as 0.46 and 0.50 respectively. The correlations for long term memorability are 0.23 and 0.25 respectively. In the run with experiment 2, we obtained 3 EXPERIMENTS Spearman’s rank correlation of 0.39 and 0.17 for short term and We ranked videos by assigning a probability like score to each long term memorability respectively. Our best submission based video clip, treating it as a regression problem. The annotations on experiment 3 gives an improvement of 7%. for short term and long term memorability were skewed towards As mentioned earlier, this model from experiment 3 used textual higher values with mean of 0.86 (short term) and 0.78 (long term). information along with visual features. The improvement shows All input features were normalized and the ground-truth was kept that the words from captions are contributing to predicting mem- unwhitened so that the model captured the skewed output distri- orability scores. We believe that additional textual features such bution. We divided the given dataset into train and validation sets as location cues, emotion cues may be useful in further improve- in the ratio 3:1 such that the annotations in two sets have similar ments. Mean values of our predictions of short term and long term distribution. We explored different combinations of features for memorability over validation data are 0.83 and 0.78 respectively. predicting memorability. The values are close to the mean values of their respective annota- Experiment 1: Low level features namely, colorfulness, blur[11] tions in the training data. This shows that our model succeeds in value, HSV histogram were concatenated and SVR was used over capturing the skewed distribution of training data. this vector. Experiment 2: We concatenated the face based feature to the 3D HSV. The resultant 130 dimension vector represents the color 5 CONCLUSION AND FUTURE WORK spread and facial information in an image. We passed this vector This paper presents the ensemble model by team AREA66 for pre- and features for C3D , aesthetics and saliency through dense fully dicting media memorability. We use visual features based on image connected layers indepedently. We ensembled these models using and video along with textual features from given captions. The re- their normalized correlation values on validation as coefficients for sults show that better results are obtained by combining visual and weighted average. textual features. On the other hand, only visual features give lowest Experiment 3: We used word embeddings to train different neural values for prediction correlation. We also noticed that experiments network architectures. CNN-LSTM and Bi-LSTM models give best with only low level image features give poor results. In future, we correlation on training and validation data. We ensembled models plan to explore effects of textual information on video memorability. Secondly, we aim to explore more sophisticated methods for utiliz- 1 https://nlp.stanford.edu/projects/glove/ ing low level image features to improve prediction performance. The 2018 Predicting Media Memorability Task MediaEval’18, 29-31 October 2018, Sophia Antipolis, France REFERENCES [1] Romain Cohendet, Claire-Hélène Demarty, Ngoc Duong, Mats Sjöberg, Bogdan Ionescu, and Thanh-Toan Do. 2018. MediaEval 2018: Predict- ing Media Memorability Task. In Proceedings of the MediaEval 2018 Workshop. 29–31 October 2018, Sophia Antipolis, France. [2] Sagnik Dhar, Vicente Ordonez, and Tamara L Berg. 2011. High level describable attributes for predicting aesthetics and interestingness. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1657–1664. [3] Rachit Dubey, Joshua Peterson, Aditya Khosla, Ming-Hsuan Yang, and Bernard Ghanem. 2015. What makes an object memorable?. In Proceedings of IEEE International Conference on Computer Vision. 1089– 1097. [4] Ankit Goyal, Naveen Kumar, Tanaya Guha, and Shrikanth S Narayanan. 2016. A multimodal mixture-of-experts model for dy- namic emotion prediction in movies. In Proceedings of IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing. IEEE, 2822–2826. [5] Michael Gygli, Helmut Grabner, Hayko Riemenschneider, Fabian Nater, and Luc Van Gool. 2013. The interestingness of images. In Proceedings of the IEEE International Conference on Computer Vision. 1633–1640. [6] Alan Hanjalic. 2006. Extracting moods from pictures and sounds: Towards truly personalized TV. In IEEE Signal Processing Magazine, Vol. 23. IEEE, 90–100. [7] David Hasler and Sabine E Suesstrunk. 2003. Measuring colorfulness in natural images. In Human vision and electronic imaging VIII, Vol. 5007. International Society for Optics and Photonics, 87–96. [8] Phillip Isola, Jianxiong Xiao, Devi Parikh, Antonio Torralba, and Aude Oliva. 2014. What makes a photograph memorable?. In IEEE trans- actions on pattern analysis and machine intelligence, Vol. 36. IEEE, 1469–1482. [9] Aditya Khosla, Akhil S Raju, Antonio Torralba, and Aude Oliva. 2015. Understanding and predicting image memorability at a large scale. In Proceedings of the IEEE International Conference on Computer Vision. 2390–2398. [10] Junting Pan, Elisa Sayrol, Xavier Giro-i Nieto, Kevin McGuinness, and Noel E O’Connor. 2016. Shallow and deep convolutional networks for saliency prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 598–606. [11] José Luis Pech-Pacheco, Gabriel Cristóbal, Jesús Chamorro-Martinez, and Joaquín Fernández-Valdivia. 2000. Diatom autofocusing in bright- field microscopy: a comparative study. In Proceedings of 15th Interna- tional Conference on Pattern Recognition, Vol. 3. IEEE, 314–317. [12] Hammad Squalli-Houssaini, Ngoc QK Duong, Marquant Gwenaëlle, and Claire-Hélène Demarty. 2018. Deep learning for predicting im- age memorability. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2371–2375. [13] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. 2014. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1701–1708. [14] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d con- volutional networks. In Proceedings of the IEEE international conference on computer vision. 4489–4497. [15] Patricia Valdez and Albert Mehrabian. 1994. Effects of color on emo- tions.. In Journal of experimental psychology: General, Vol. 123. Ameri- can Psychological Association, 394.