=Paper= {{Paper |id=Vol-3181/paper53 |storemode=property |title=Exploring Multimodality, Perplexity and Explainability for Memorability Prediction |pdfUrl=https://ceur-ws.org/Vol-3181/paper53.pdf |volume=Vol-3181 |authors=Alison Reboud,Ismail Harrando,Jorma Laaksonen,Raphaël Troncy |dblpUrl=https://dblp.org/rec/conf/mediaeval/ReboudHLT21 }} ==Exploring Multimodality, Perplexity and Explainability for Memorability Prediction== https://ceur-ws.org/Vol-3181/paper53.pdf
        Exploring Multimodality, Perplexity and Explainability for
                        Memorability Prediction

                        Alison Reboud* , Ismail Harrando* , Jorma Laaksonen+ and Raphaël Troncy*
                                                            * EURECOM, Sophia Antipolis, France
                                                               + Aalto University, Espoo, Finland

                                             {alison.reboud,ismail.harrando,raphael.troncy}@eurecom.fr
                                                              jorma.laaksonen@aalto.fi
ABSTRACT                                                                              • Using again Sentence-BERT with the model fine-tuned on the Ya-
This paper describes several approaches proposed by the MeMAD                           hoo answers topics dataset, comprising of questions and answers
Team for the MediaEval 2021 “Predicting Media Memorability” task.                       from Yahoo Answers, classified into 10 topics.
Our best approach is based on early fusion of multimodal (visual                      For each representation, we experimented with multiple regression
and textual) features. We also designed one of our run to be ex-                      models and fine-tuned the hyper-parameters using a fixed 6-fold
plainable in order to give new insights into the topic of audio visual                cross-validation on the training set. For our submission, we used
content memorability. Finally, one of our runs is an experiment                       the Sentence-BERT on Yahoo answers topic dataset model.
in analysing the potential role played by text perplexity in video
content memorability.                                                                    Visual features. We extracted 2048-dimensional I3D [3] features
                                                                                      to describe the visual content of the videos. The I3D features are
                                                                                      extracted from the Mixed_5c layer of the readily-available model
1     APPROACH                                                                        trained with the Kinetics-400 dataset [7]. These features perfor-
The description of the task as well as the metrics used for its eval-                 mance are superior to those extracted from the 400-dimensional
uation is detailed in [8]. We have experimented in the past with                      classification output and the C3D [15] features provided by the task
approaches combining textual and visual features [12] as well as                      organizers.
using visio-linguistic models [13] for predicting short and long term                    Audio features. We used 527-dimensional audio features that en-
media memorability. This year, we have explored other methods                         code the occurrence probabilities of the 527 classes of the Google Au-
including: i) performing early fusion of multimodal features, ii)                     dioSet Ontology [5] in each video clip. The model uses the readily-
attempting to explain whether some phrases could trigger memora-                      available VGGish feature extraction model [6].
bility or not and iii) estimating the perplexity of video descriptions.
Our code to enable reproducibility of our approaches is available at                     Prediction model. In all our early fusion experiments, the respec-
https://github.com/MeMAD-project/media-memorability.                                  tive features were concatenated to create multimodal input feature
                                                                                      vectors. We used a feed-forward network with one hidden layer
1.1     Early Fusion of Multimodal Features                                           to predict the memorability score. We varied the number of units
   Textual features. Our textual approach uses the video descrip-                     in the hidden layer and optimized it together with the number of
tions (or captions) provided by the task organizers. First, we con-                   training epochs. We used ReLU non-linearity and dropout between
catenate the video descriptions to obtain one string for each video.                  the layers and simple sigmoid output for the regression result. The
Then, to get a textual representation of the video content, we ex-                    experiments used the same 6-fold cross-validation on the training
perimented with the following methods:                                                set. The best models typically consisted of 600 units in the hidden
                                                                                      layer and needed 700 training epochs to produce the maximal Spear-
• Computing TF-IDF, removing rare (less than 4 occurrences) and
                                                                                      man correlation score. We have also experimented with a weighted
  stopwords and accounting for frequent 2-grams.
                                                                                      average to combine modalities, but early fusion turned out to be
• Averaging GloVe embeddings for all non-stopwords words using
                                                                                      more successful.
  the pre-trained 300d version [11].
• Averaging BERT [4] token representations (keeping all the words
                                                                                      1.2    Exploring Explainability
  in the descriptions up to 250 words per sentence).
• Using Sentence-BERT [14] sentence representations and in par-                       We have experimented with different simple text-based models that
  ticular the distilled version that is fine-tuned for the STS Textual                offer the possibility to quantify the relation between the caption
  Similarity Benchmark1                                                               and the predicted memorability score in an explainable manner. We
                                                                                      train the models given the specific sub-task and dataset, i.e. for the
1 https://huggingface.co/sentence-transformers/distilbert-base-nli-stsb-mean-tokens
                                                                                      short-term memorability predictions, we train the models on the
                                                                                      short-term memorability scores.
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons       We compare feeding simple linear models (regressors) inter-
License Attribution 4.0 International (CC BY 4.0).
MediaEval’21, December 13-15 2021, Online                                             pretable input features: bag of words, TF-iDF, and topic distribu-
                                                                                      tions produced by an LDA model [2] trained on the corpus made
MediaEval’21, December 13-15 2021, Online                                                                                            A. Reboud et al.


of captions. Upon evaluating the performance of each model/input                 Table 1: Results on the TRECVID Test set for Short Term
feature pair in a cross-fold validation protocol, we obtain the best             Raw (STr), Short Term Normalized (STn) and Long Term (LT)
results using TF-iDF features with a Linear Support Vector Regres-               memorability (Sp = Spearman, Pe= Pearson)
sion (LinearSVR2 ). On one hand, this model allows us to somewhat
understand the correspondence between some input words and                         Method     SpSTr     PeSTr    SpSTn     PeSTn     SpLT     PeLT
the final score of classification. For example, the top words for
                                                                                     run1      0.127    0.153     0.158    0.168     0.016    0.014
raw and normalized short-term memorability on both Memento10K
                                                                                     run2      0.216    0.212     0.221    0.209     0.060    0.090
and TRECVID is woman. On the other hand, the empirical perfor-
                                                                                     run3      0.220    0.214     0.226    0.218     0.063    0.098
mance on both subtasks falls significantly behind other models,
                                                                                     run4     –0.050    0.013    –0.052    0.018    –0.043    0.024
demonstrating both the non-linear and multimodal nature of mem-
                                                                                     run5      0.196    0.215     0.211    0.222     0.062    0.059
orability.

1.3     Exploring Perplexity                                                     Table 2: Results on the Memento10K Test set for Short Term
It has been suggested that memorable content can be found in                     Raw (STr) and Short Term Normalized (STn) memorability
sparse areas of an attribute space [1]. For example, images with
convolutional neural networks features sparsely distributed have                             Method     SpSTr    PeSTr    SpSTn     PeSTn
been found to be more memorable [9]. Additionally, we observe
                                                                                              run1      0.464    0.460     0.463    0.458
that the results obtained on the TRECVID dataset (made of short
                                                                                              run2      0.658    0.674     0.657    0.674
videos from Vine) are considerably worse than those obtained on
                                                                                              run3      0.655    0.672     0.658    0.675
the Memento10K dataset which may be due to the fact that the
                                                                                              run4      0.073    0.064     0.077    0.069
TRECVID dataset is smaller but also much more diverse. One hy-
                                                                                              run5      0.654    0.672     0.651    0.671
pothesis is that popular vines break with expectations. Backing
this hypothesis, we have found in the TRECVID dataset that videos
depicting a person eating a car, or a chicken coming out of an egg               Table 3: Generalisation subtask: results on the TRECVID Test
to have a high memorability score. Therefore, inspired by [10] who               set for Short Term Raw (STr), Short Term Normalized (STn)
predicts the novelty of a caption, we wanted to test the hypothesis              and Long Term (LT) memorability
that the novelty of a caption influences its memorability.
    We explore the (pseudo-)perplexity of each video description                    Method     SpSTr    PeSTr    SpSTn     PeSTn     SpLT     PeLT
using a pretrained RoBERTa-large model. The score for each caption
is computed by adding up the log probabilities of each masked token                  run1      0.076    0.099     0.068    0.091    -0.013    0.021
in the caption, and the aggregation between captions is done with                    run2      0.140    0.165     0.146    0.170     0.045    0.042
a max function. We select the caption with the highest perplexity
for each video. All runs have identical scores for each dataset as we            Table 4: Generalisation subtask: results on the Memento10K
do not use the training set at all in this method.                               Test set for Short Term Raw (STr) and Short Term Normalized
                                                                                 (STn) memorability
2     RESULTS AND DISCUSSION
We have prepared 5 different runs following the task description                             Method     SpSTr    PeSTr    SpSTn     PeSTn
defined as follows:
                                                                                              run1      0.196    0.196     0.181    0.184
• run1 = Explainable (Section 1.2)                                                            run2      0.310    0.313     0.320    0.316
• run2 = Early Fusion of Textual+Visual+Audio features
• run3 = Early Fusion of Textual+Visual features
• run4 = Perplexity-based (Section 1.3)                                          and the results for Memento10K are always better. Combining the
• run5 = Early fusion of Textual+Visual features trained on the                  datasets did not yield better results. This is not very surprising for
  combined (TRECVID + Memento10k) datasets                                       the Memento10K results since it is a bigger dataset. However, the
All models except the run1 use exclusively short-term scores for                 fact that augmenting the TRECVID dataset did not lead to signif-
predicting the long-term score.                                                  icant improvement suggests that beyond a size difference, there
   We present in Tables 1 and 2 the final results obtained on the                is a difference in nature between the datasets that leads to a bad
test set of respectively the TRECVID and the Memento10k datasets.                generalisation in terms of prediction. This fact is confirmed by
We comment on the Spearman Rank scores as this is the official                   the generalisation subtask which yields significantly worse results
evaluation metrics. We observe that the early fusion method which                for both Memento10K and TRECVID. Finally the scores obtained
uses short term scores works the best for both short and long term               with the perplexity run were by far the lowest, only reaching 0.073
predictions. Adding the audio modality features did not improve                  for Memento10K when our best run obtained 0.658. With this run,
the results. We can also observe that the results for Long Term pre-             rather than obtaining the best results, we wanted to evaluate the po-
diction are always worse than the ones for Short Term prediction                 tential for adding a caption perplexity measure. At this stage, these
                                                                                 results do not suggest a strong relationship between perplexity and
2 https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html
                                                                                 memorability.
Predicting Media Memorability                                                                    MediaEval’21, December 13-15 2021, Online


REFERENCES                                                                      Convolutional Networks. In International Conference on Computer
 [1] Wilma A Bainbridge. 2021. Shared memories driven by the intrin-            Vision (ICCV). IEEE, Santiago, Chile, 4489–4497.
     sic memorability of items. Human Perception of Visual Information:
     Psychological and Computational Perspectives (2021).
 [2] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent
     Dirichlet Allocation. Journal of Machine Learning Research 3 (2003),
     993—-1022.
 [3] João Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recog-
     nition? A New Model and the Kinetics Dataset. In IEEE Conference on
     Computer Vision and Pattern Recognition (CVPR). IEEE, 4724–4733.
 [4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
     2019. BERT: Pre-training of Deep Bidirectional Transformers for
     Language Understanding. In Conference of the North American Chap-
     ter of the Association for Computational Linguistics (NAACL). ACL,
     Minneapolis, Minnesota, USA, 4171—-4186.
 [5] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade
     Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017.
     Audio set: An ontology and human-labeled dataset for audio events. In
     IEEE International Conference on Acoustics, Speech and Signal Processing
     (ICASSP). New Orleans, Louisiana, USA, 776–780.
 [6] Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gem-
     meke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt,
     Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and
     Kevin Wilson. 2017. CNN Architectures for Large-Scale Audio Classi-
     fication. (2017). arXiv:cs.SD/1609.09430
 [7] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier,
     Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back,
     Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The
     Kinetics Human Action Video Dataset. (2017). arXiv:cs.CV/1705.06950
 [8] Rukiye Savran Kiziltepe, Mihai Gabriel Constantin, Claire-Hélène
     Demarty, Graham Healy, Camilo Fosco, Alba García Seco de Herrera,
     Sebastian Halder, Bogdan Ionescu, Ana Matran-Fernandez, Alan F.
     Smeaton, and Lorin Sweeney. 2021. Overview of The MediaEval
     2021 Predicting Media Memorability Task. In Multimedia Benchmark
     Workshop (MediaEval).
 [9] Jiří Lukavskỳ and Filip Děchtěrenko. 2017. Visual properties and
     memorising scenes: Effects of image-space sparseness and uniformity.
     Attention, Perception, & Psychophysics 79, 7 (2017), 2044–2054.
[10] Nianzu Ma, Alexander Politowicz, Sahisnu Mazumder, Jiahua Chen,
     Bing Liu, Eric Robertson, and Scott Grigsby. 2021. Semantic Novelty
     Detection in Natural Language Descriptions. In International Confer-
     ence on Empirical Methods in Natural Language Processing (EMNLP).
     866–882.
[11] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014.
     Glove: Global vectors for word representation. In International Con-
     ference on Empirical Methods in Natural Language Processing (EMNLP).
     ACL, Melbourne, Australia, 1532—-1543.
[12] Alison Reboud, Ismail Harrando, Jorma Laaksonen, Danny Francis,
     Raphael Troncy, and Hector Laria Mantecon. 2019. Combining Textual
     and Visual Modeling for Predicting Media Memorability. In Multime-
     dia Benchmark Workshop (MediaEval) (CEUR Workshop Proceedings),
     Vol. 2670.
[13] Alison Reboud, Ismail Harrando, Jorma Laaksonen, and Raphael
     Troncy. 2020. Predicting Media Memorability with Audio, Video, and
     Text representations. In Multimedia Benchmark Workshop (MediaEval)
     (CEUR Workshop Proceedings), Vol. 2882.
[14] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence
     Embeddings using Siamese BERT-Networks. In International Confer-
     ence on Empirical Methods in Natural Language Processing (EMNLP).
     ACL, Hong Kong, China, 3982—-10.
[15] Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and
     Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D