=Paper=
{{Paper
|id=Vol-3181/paper53
|storemode=property
|title=Exploring Multimodality, Perplexity and Explainability for Memorability
Prediction
|pdfUrl=https://ceur-ws.org/Vol-3181/paper53.pdf
|volume=Vol-3181
|authors=Alison Reboud,Ismail Harrando,Jorma Laaksonen,Raphaël Troncy
|dblpUrl=https://dblp.org/rec/conf/mediaeval/ReboudHLT21
}}
==Exploring Multimodality, Perplexity and Explainability for Memorability
Prediction==
Exploring Multimodality, Perplexity and Explainability for Memorability Prediction Alison Reboud* , Ismail Harrando* , Jorma Laaksonen+ and Raphaël Troncy* * EURECOM, Sophia Antipolis, France + Aalto University, Espoo, Finland {alison.reboud,ismail.harrando,raphael.troncy}@eurecom.fr jorma.laaksonen@aalto.fi ABSTRACT • Using again Sentence-BERT with the model fine-tuned on the Ya- This paper describes several approaches proposed by the MeMAD hoo answers topics dataset, comprising of questions and answers Team for the MediaEval 2021 “Predicting Media Memorability” task. from Yahoo Answers, classified into 10 topics. Our best approach is based on early fusion of multimodal (visual For each representation, we experimented with multiple regression and textual) features. We also designed one of our run to be ex- models and fine-tuned the hyper-parameters using a fixed 6-fold plainable in order to give new insights into the topic of audio visual cross-validation on the training set. For our submission, we used content memorability. Finally, one of our runs is an experiment the Sentence-BERT on Yahoo answers topic dataset model. in analysing the potential role played by text perplexity in video content memorability. Visual features. We extracted 2048-dimensional I3D [3] features to describe the visual content of the videos. The I3D features are extracted from the Mixed_5c layer of the readily-available model 1 APPROACH trained with the Kinetics-400 dataset [7]. These features perfor- The description of the task as well as the metrics used for its eval- mance are superior to those extracted from the 400-dimensional uation is detailed in [8]. We have experimented in the past with classification output and the C3D [15] features provided by the task approaches combining textual and visual features [12] as well as organizers. using visio-linguistic models [13] for predicting short and long term Audio features. We used 527-dimensional audio features that en- media memorability. This year, we have explored other methods code the occurrence probabilities of the 527 classes of the Google Au- including: i) performing early fusion of multimodal features, ii) dioSet Ontology [5] in each video clip. The model uses the readily- attempting to explain whether some phrases could trigger memora- available VGGish feature extraction model [6]. bility or not and iii) estimating the perplexity of video descriptions. Our code to enable reproducibility of our approaches is available at Prediction model. In all our early fusion experiments, the respec- https://github.com/MeMAD-project/media-memorability. tive features were concatenated to create multimodal input feature vectors. We used a feed-forward network with one hidden layer 1.1 Early Fusion of Multimodal Features to predict the memorability score. We varied the number of units Textual features. Our textual approach uses the video descrip- in the hidden layer and optimized it together with the number of tions (or captions) provided by the task organizers. First, we con- training epochs. We used ReLU non-linearity and dropout between catenate the video descriptions to obtain one string for each video. the layers and simple sigmoid output for the regression result. The Then, to get a textual representation of the video content, we ex- experiments used the same 6-fold cross-validation on the training perimented with the following methods: set. The best models typically consisted of 600 units in the hidden layer and needed 700 training epochs to produce the maximal Spear- • Computing TF-IDF, removing rare (less than 4 occurrences) and man correlation score. We have also experimented with a weighted stopwords and accounting for frequent 2-grams. average to combine modalities, but early fusion turned out to be • Averaging GloVe embeddings for all non-stopwords words using more successful. the pre-trained 300d version [11]. • Averaging BERT [4] token representations (keeping all the words 1.2 Exploring Explainability in the descriptions up to 250 words per sentence). • Using Sentence-BERT [14] sentence representations and in par- We have experimented with different simple text-based models that ticular the distilled version that is fine-tuned for the STS Textual offer the possibility to quantify the relation between the caption Similarity Benchmark1 and the predicted memorability score in an explainable manner. We train the models given the specific sub-task and dataset, i.e. for the 1 https://huggingface.co/sentence-transformers/distilbert-base-nli-stsb-mean-tokens short-term memorability predictions, we train the models on the short-term memorability scores. Copyright 2021 for this paper by its authors. Use permitted under Creative Commons We compare feeding simple linear models (regressors) inter- License Attribution 4.0 International (CC BY 4.0). MediaEval’21, December 13-15 2021, Online pretable input features: bag of words, TF-iDF, and topic distribu- tions produced by an LDA model [2] trained on the corpus made MediaEval’21, December 13-15 2021, Online A. Reboud et al. of captions. Upon evaluating the performance of each model/input Table 1: Results on the TRECVID Test set for Short Term feature pair in a cross-fold validation protocol, we obtain the best Raw (STr), Short Term Normalized (STn) and Long Term (LT) results using TF-iDF features with a Linear Support Vector Regres- memorability (Sp = Spearman, Pe= Pearson) sion (LinearSVR2 ). On one hand, this model allows us to somewhat understand the correspondence between some input words and Method SpSTr PeSTr SpSTn PeSTn SpLT PeLT the final score of classification. For example, the top words for run1 0.127 0.153 0.158 0.168 0.016 0.014 raw and normalized short-term memorability on both Memento10K run2 0.216 0.212 0.221 0.209 0.060 0.090 and TRECVID is woman. On the other hand, the empirical perfor- run3 0.220 0.214 0.226 0.218 0.063 0.098 mance on both subtasks falls significantly behind other models, run4 –0.050 0.013 –0.052 0.018 –0.043 0.024 demonstrating both the non-linear and multimodal nature of mem- run5 0.196 0.215 0.211 0.222 0.062 0.059 orability. 1.3 Exploring Perplexity Table 2: Results on the Memento10K Test set for Short Term It has been suggested that memorable content can be found in Raw (STr) and Short Term Normalized (STn) memorability sparse areas of an attribute space [1]. For example, images with convolutional neural networks features sparsely distributed have Method SpSTr PeSTr SpSTn PeSTn been found to be more memorable [9]. Additionally, we observe run1 0.464 0.460 0.463 0.458 that the results obtained on the TRECVID dataset (made of short run2 0.658 0.674 0.657 0.674 videos from Vine) are considerably worse than those obtained on run3 0.655 0.672 0.658 0.675 the Memento10K dataset which may be due to the fact that the run4 0.073 0.064 0.077 0.069 TRECVID dataset is smaller but also much more diverse. One hy- run5 0.654 0.672 0.651 0.671 pothesis is that popular vines break with expectations. Backing this hypothesis, we have found in the TRECVID dataset that videos depicting a person eating a car, or a chicken coming out of an egg Table 3: Generalisation subtask: results on the TRECVID Test to have a high memorability score. Therefore, inspired by [10] who set for Short Term Raw (STr), Short Term Normalized (STn) predicts the novelty of a caption, we wanted to test the hypothesis and Long Term (LT) memorability that the novelty of a caption influences its memorability. We explore the (pseudo-)perplexity of each video description Method SpSTr PeSTr SpSTn PeSTn SpLT PeLT using a pretrained RoBERTa-large model. The score for each caption is computed by adding up the log probabilities of each masked token run1 0.076 0.099 0.068 0.091 -0.013 0.021 in the caption, and the aggregation between captions is done with run2 0.140 0.165 0.146 0.170 0.045 0.042 a max function. We select the caption with the highest perplexity for each video. All runs have identical scores for each dataset as we Table 4: Generalisation subtask: results on the Memento10K do not use the training set at all in this method. Test set for Short Term Raw (STr) and Short Term Normalized (STn) memorability 2 RESULTS AND DISCUSSION We have prepared 5 different runs following the task description Method SpSTr PeSTr SpSTn PeSTn defined as follows: run1 0.196 0.196 0.181 0.184 • run1 = Explainable (Section 1.2) run2 0.310 0.313 0.320 0.316 • run2 = Early Fusion of Textual+Visual+Audio features • run3 = Early Fusion of Textual+Visual features • run4 = Perplexity-based (Section 1.3) and the results for Memento10K are always better. Combining the • run5 = Early fusion of Textual+Visual features trained on the datasets did not yield better results. This is not very surprising for combined (TRECVID + Memento10k) datasets the Memento10K results since it is a bigger dataset. However, the All models except the run1 use exclusively short-term scores for fact that augmenting the TRECVID dataset did not lead to signif- predicting the long-term score. icant improvement suggests that beyond a size difference, there We present in Tables 1 and 2 the final results obtained on the is a difference in nature between the datasets that leads to a bad test set of respectively the TRECVID and the Memento10k datasets. generalisation in terms of prediction. This fact is confirmed by We comment on the Spearman Rank scores as this is the official the generalisation subtask which yields significantly worse results evaluation metrics. We observe that the early fusion method which for both Memento10K and TRECVID. Finally the scores obtained uses short term scores works the best for both short and long term with the perplexity run were by far the lowest, only reaching 0.073 predictions. Adding the audio modality features did not improve for Memento10K when our best run obtained 0.658. With this run, the results. We can also observe that the results for Long Term pre- rather than obtaining the best results, we wanted to evaluate the po- diction are always worse than the ones for Short Term prediction tential for adding a caption perplexity measure. At this stage, these results do not suggest a strong relationship between perplexity and 2 https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html memorability. Predicting Media Memorability MediaEval’21, December 13-15 2021, Online REFERENCES Convolutional Networks. In International Conference on Computer [1] Wilma A Bainbridge. 2021. Shared memories driven by the intrin- Vision (ICCV). IEEE, Santiago, Chile, 4489–4497. sic memorability of items. Human Perception of Visual Information: Psychological and Computational Perspectives (2021). [2] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 3 (2003), 993—-1022. [3] João Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recog- nition? A New Model and the Kinetics Dataset. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 4724–4733. [4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Conference of the North American Chap- ter of the Association for Computational Linguistics (NAACL). ACL, Minneapolis, Minnesota, USA, 4171—-4186. [5] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). New Orleans, Louisiana, USA, 776–780. [6] Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gem- meke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin Wilson. 2017. CNN Architectures for Large-Scale Audio Classi- fication. (2017). arXiv:cs.SD/1609.09430 [7] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The Kinetics Human Action Video Dataset. (2017). arXiv:cs.CV/1705.06950 [8] Rukiye Savran Kiziltepe, Mihai Gabriel Constantin, Claire-Hélène Demarty, Graham Healy, Camilo Fosco, Alba García Seco de Herrera, Sebastian Halder, Bogdan Ionescu, Ana Matran-Fernandez, Alan F. Smeaton, and Lorin Sweeney. 2021. Overview of The MediaEval 2021 Predicting Media Memorability Task. In Multimedia Benchmark Workshop (MediaEval). [9] Jiří Lukavskỳ and Filip Děchtěrenko. 2017. Visual properties and memorising scenes: Effects of image-space sparseness and uniformity. Attention, Perception, & Psychophysics 79, 7 (2017), 2044–2054. [10] Nianzu Ma, Alexander Politowicz, Sahisnu Mazumder, Jiahua Chen, Bing Liu, Eric Robertson, and Scott Grigsby. 2021. Semantic Novelty Detection in Natural Language Descriptions. In International Confer- ence on Empirical Methods in Natural Language Processing (EMNLP). 866–882. [11] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In International Con- ference on Empirical Methods in Natural Language Processing (EMNLP). ACL, Melbourne, Australia, 1532—-1543. [12] Alison Reboud, Ismail Harrando, Jorma Laaksonen, Danny Francis, Raphael Troncy, and Hector Laria Mantecon. 2019. Combining Textual and Visual Modeling for Predicting Media Memorability. In Multime- dia Benchmark Workshop (MediaEval) (CEUR Workshop Proceedings), Vol. 2670. [13] Alison Reboud, Ismail Harrando, Jorma Laaksonen, and Raphael Troncy. 2020. Predicting Media Memorability with Audio, Video, and Text representations. In Multimedia Benchmark Workshop (MediaEval) (CEUR Workshop Proceedings), Vol. 2882. [14] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In International Confer- ence on Empirical Methods in Natural Language Processing (EMNLP). ACL, Hong Kong, China, 3982—-10. [15] Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D