THAU-UPM at MediaEval 2021: From Video Semantics To
             Memorability Using Pretrained Transformers
                      Ricardo Kleinlein , Cristina Luna-Jiménez , Fernando Fernández-Martínez
           Grupo de Tecnología del Habla y Aprendizaje Automático (THAU Group), Information Processing and
     Telecommunications Center, E.T.S.I. de Telecomunicación, Universidad Politécnica de Madrid, Avda. Complutense 30,
                                                    28040 Madrid, Spain
                             {ricardo.kleinlein,cristina.lunaj,fernando.fernandezm}@upm.es

ABSTRACT                                                                             related to people) that are inherently better remembered, than other
This paper reports on our experience after participating at the                      themes such as nature, war-like scenes and open spaces [8, 11].
MediaEval 2021: Predicting Media Memorability challenge. The                         Moreover, it seems that a major principle in creating new memories
memorability of a video is defined as the proportion of people that                  comes from the fact that the brain deals with scene and object
successfully remembered having watched a video on a second view-                     representations at the same level of abstraction [12]. This highlights
ing during a memory game. Given this setup, teams were requested                     the need for global descriptors of the media content if the goal is
to provide systems able to predict the degree of memorability for                    to predict its likelihood to be remembered.
individual videos from two different datasets: TRECVid and Me-                          In recent times, the Transformer family of models has been pro-
mento10k. Our proposal builds upon previous work in which we                         posed as an alternative to other neural architectures, with promising
find that non-adapted features extracted from Transformer architec-                  results until now [4, 5, 14, 17]. This is largely due to the inner repre-
tures can be closely tied to semantic differences between samples,                   sentation of input features these models are able to compute, which
which in turn point to the overall memorability degree within differ-                tend to show a high degree of robustness to previously unseen
ent semantic units, or topics. We feed these precomputed features                    data. Because of their success, we use them as either text of image
to linear regressors, showing that even without adapting the input                   encoders, in order to transform text descriptions or video frames
representation competitive prediction rates can be achieved.                         into meaningful, semantically-rich vector embeddings.

                                                                                     3     APPROACH
1    INTRODUCTION                                                                    In a previous study[10], we found that a Transformer trained on
Scientific modelling of cognitive variables of human perception of                   a sentence similarity task yielded features closely aligned to the
multimedia productions has eluded a mathematical formulation un-                     automatic detection of topics within the set of available video de-
til the last decades, leaving it as a discipline within psychology[1].               scriptions. We also observed that some semantic units like human,
Although usually perceived to be largely dependent on the sub-                       baby, girl, or man showed a higher average memorability than other
jective appraisals experienced by an individual, the analysis of                     topics closer to nature views, open spaces or war-like contents. One
group-level data sets points to the existence of patterns most hu-                   of the pillars of our analysis relied on the fact that the model used
mans attach at least to some degree when faced before multimedia                     to encode sentences into embeddings was not fine-tuned or adapted
content. One such instance is the problem of media memorability.                     to our task. Therefore, here we extend our methodology to other
    The MediaEval workshop, and in particular the Predicting Me-                     pretrained Transformer architectures.1
dia Memorability challenge, provides now for the 4𝑡ℎ consecutive                        Here we explore a wider range of models, covering systems able
edition with reliable data that researchers can use to further under-                to deal not only with text inputs, but also with visual information.
stand media memorability. A detailed description of the challenge,                   The main distinction between different runs (shown in Table 1)
as well as the data sources used in this task can be found in [9].                   lies in the model combinations used to encode the textual and
                                                                                     visual features. These embeddings are then fed as input to linear
2    RELATED WORK                                                                    regression models that constitute the only part of the pipeline
From the seminal work of Isola et al.[7], researchers have inves-                    specifically trained on the task of predicting media memorability.
tigated whether the prediction of media memorability depends                         Every video is represented by a single embedding, computed as
primarily on visual descriptors such as image colour; brightness;                    the mean value of that video’s individual sentence or frame-level
and hue, as opposed to other approaches, which suggested that                        embeddings.
high-level, data-driven representations (e.g., image composition,
scene recognition, and image classification features) are best suited                3.1     Text Transformers
to the task.                                                                         Language is a natural way to describe to others what we see, and
   Our hypothesis, supported by studies from both neuroscience                       hence through it, we can encapsulate the semantics of a video in a
and psychology, is that there are certain topics (particularly those                 succinct and readable way. We choose three different architectures,
                                                                                     SBERT [16]; GPT-2[15]; and CLIP[14], each covering a different
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).                                   aspect of language modelling. SBERT is a variation of the popular
MediaEval’21, December 13-15 2021, Online
                                                                                     1 All the models used here can be downloaded from https://huggingface.co.
MediaEval’21, December 13-15 2021, Online                                                            R. Kleinlein, C. Luna-Jiménez, F. Fernández-Martínez

                    Run          Dataset         SBERT        GPT-2    CLIP (text)   CLIP (visual)    ViT     BEIT   PCA dims.         Method
                              TRECVid                                                                                    64            Bayes
                    1                                x             x       x
                             Memento10k                                                                                  256            LR
                              TRECVid                                                                                    32
                    2                                                                      x                                           Bayes
                             Memento10k                                                                                  512
                    3             Both                                                     x            x       x       2048           Bayes
                    4             Both               x             x       x               x            x       x       4096            LR
                    5             Both               x             x       x               x            x       x       4096           Bayes
Table 1: Overview of the runs submitted. Different runs solve the task using different sets of precomputed features. Within
every dataset, the same solution is proposed despite labels be raw or normalised.


BERT language model[4]; the embeddings computed using SBERT                               Dataset           Labels    run 1    run 2     run 3   run 4   run 5
are targeted at telling apart pairs of sentences with similar or dis-                                    short-raw    0.204    0.265    0.291    0.205   0.198
similar meaning, which is beneficial when looking for topics in                           TRECVid       short-norm    0.193    0.272    0.293    0.193   0.198
texts. We use the all-mpnet-base-v2 implementation. GPT-2 set a                                            long       0.125    0.102    0.077    0.009    0.01
remarkable milestone in the path of automatic text generation[15],
                                                                                                             raw      0.596    0.601    0.656    0.651   0.651
since it is able to synthesize texts coherent both in structure, use of                Memento10k
                                                                                                            norm      0.598    0.606    0.657    0.652   0.651
language and grammar. Features extracted using this model build a
general-purpose language representation. CLIP was announced as                        Table 2: Spearman’s rank correlation coefficient of our pro-
a model able to combine information from both visual and textual                      posed models when evaluated over the official test set. In
sources in order to perform image classification and image synthe-                    bold, the run that performed the best on each set of labels.
sis simultaneously [14]. Its text-encoder is considered separately
from the rest of the model to encode sentences describing videos
with emphasis on the content of the video.
                                                                                      4    RESULTS AND OUTLINE
3.2     Visual Transformers
                                                                                      Table 2 shows Spearman’s rank correlation coefficient values over
Although text descriptions can convey most of the semantic units
                                                                                      the test set of data for each run submitted. First, it is noticeable that
within a video clip, many aspects of the clip itself are missed. For
                                                                                      the combination of all visual embeddings outperforms any other
instance, a text such as "two people walking" can evoke a unending
                                                                                      approach, except in the long-term set of labels. In that case, it points
amount of different images. However, extracting the semantics from
                                                                                      to the possibility that text-based representations may encode better
images is a process far more complex to interpret and analyze. For-
                                                                                      the semantics needed to predict long-term media memorability.
tunately, Transformers have also been applied to computer vision
                                                                                      Another interesting observation can be made about the relative
tasks. Hence, we can proceed analogously and elaborate on the em-
                                                                                      difference in performance shown by the same approaches depend-
bedding representations extracted from video frames (extracted at
                                                                                      ing on which dataset is considered. In fact, the worsening in the
1 FPS) using pretrained models. We use the visual branch of a CLIP
                                                                                      predictions made over TRECVid data is likely to be related to its
model, plus two additional systems designed under the same guid-
                                                                                      smaller size, as well as the fact that we have noticed that most
ing principles of the original BERT. In particular, we use BEiT [2]
                                                                                      videos within TRECVid fall in a narrow range of the topics detected
and ViT [18] as additional visual encoders. Both were trained on im-
                                                                                      in Memento10k.
age classification over the ImageNet-21k dataset [3], at a resolution
of 224x224 pixels, though following different approaches.
                                                                                      ACKNOWLEDGMENTS
3.3     Predictive models                                                             The work leading to these results was supported by the Spanish Min-
We limit our setup to simple linear predictors: linear regression and                 istry of Scienceand Innovation through the GOMINOLA (PID2020-
Naïve Bayes regression. Both are simple enough conceptually, yet                      118112RB-C21 and PID2020-118112RB-C22,funded by MCIN/AEI/
different in their inner working2 , allowing us to concentrate our                    10.13039/501100011033), CAVIAR (TEC2017-84593-C2-1-R, funded
efforts on the predictive power of the non-adapted input features.                    by MCIN/AEI/10.13039/501100011033/FEDER “Una manera de hacer
Also, Principal Component Analysis (PCA) is used to project the                       Europa”), and AMIC (TIN2017-85854-C4-4-R, funded by MCIN/AEI/
input vectors on spaces with lower dimensionality [6]. Each run was                   10.13039/501100011033/FEDER “Una manera de hacer Europa”)
submitted according to the learning method and PCA dimensions                         projects. This research also received funding from the European
that performed the best over the development set of data on each                      Union’s Horizon2020 research and innovation program under grant
dataset.                                                                              agreement Nº823907 (http://menhir-project.eu, accessedon17 No-
                                                                                      vember 2021). Furthermore, R.K.’s research was supported by the
2 We used the default implementations from sklearn library [13].                      Spanish Ministry ofEducation (FPI grant PRE2018-083225).
Predicting Media Memorability                                                                            MediaEval’21, December 13-15 2021, Online


REFERENCES                                                                            In ICML.
 [1] Rudolf Arnheim. 1954. Art and visual perception: a psychology of the        [15] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and
     creative eye. University of California Press.                                    Ilya Sutskever. 2019. Language Models are Unsupervised Multitask
 [2] Hangbo Bao, Li Dong, and Furu Wei. 2021. BEiT: BERT Pre-Training                 Learners. (2019).
     of Image Transformers. (2021). arXiv:2106.08254 https://arxiv.org/          [16] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence
     abs/2106.08254                                                                   Embeddings using Siamese BERT-Networks. In Proceedings of the
 [3] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.           2019 Conference on Empirical Methods in Natural Language Processing.
     2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE          Association for Computational Linguistics. http://arxiv.org/abs/1908.
     conference on computer vision and pattern recognition. Ieee, 248–255.            10084
 [4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.           [17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
     2019. BERT: Pre-training of Deep Bidirectional Transformers for Lan-             Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017.
     guage Understanding. In Proceedings of the 2019 Conference of the                Attention is All you Need. In Advances in Neural Information Pro-
     North American Chapter of the Association for Computational Linguis-             cessing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,
     tics: Human Language Technologies, Volume 1 (Long and Short Papers).             R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Cur-
     Association for Computational Linguistics, Minneapolis, Minnesota,               ran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/
     4171–4186. https://doi.org/10.18653/v1/N19-1423                                  3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
 [5] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weis-           [18] Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang,
     senborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani,                     Zhicheng Yan, Masayoshi Tomizuka, Joseph Gonzalez, Kurt Keutzer,
     Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit,                and Peter Vajda. 2020. Visual Transformers: Token-based Im-
     and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transform-                age Representation and Processing for Computer Vision. (2020).
     ers for Image Recognition at Scale. ArXiv abs/2010.11929 (2021).                 arXiv:cs.CV/2006.03677
 [6] Karl Pearson F.R.S. 1901. LIII. On lines and planes of closest fit
     to systems of points in space. The London, Edinburgh, and Dublin
     Philosophical Magazine and Journal of Science 2, 11 (1901), 559–572.
     https://doi.org/10.1080/14786440109462720
 [7] Phillip Isola, Jianxiong Xiao, Devi Parikh, Antonio Torralba, and Aude
     Oliva. 2014. What Makes a Photograph Memorable? IEEE Transactions
     on Pattern Analysis and Machine Intelligence 36, 7 (2014), 1469–1482.
     https://doi.org/10.1109/TPAMI.2013.200
 [8] Andrew Jaegle, Vahid Mehrpour, Yalda Mohsenzadeh, Travis Meyer,
     Aude Oliva, and Nicole Rust. 2019. Population response magnitude
     variation in inferotemporal cortex predicts image memorability. eLife
     8 (aug 2019), e47596. https://doi.org/10.7554/eLife.47596
 [9] Rukiye Savran Kiziltepe, Mihai Gabriel Constantin, Claire-Hélène
     Demarty, Graham Healy, Camilo Fosco, Alba García Seco de Herrera,
     Sebastian Halder, Bogdan Ionescu, Ana Matran-Fernandez, Alan F.
     Smeaton, and Lorin Sweeney. 2021. Overview of The MediaEval 2021
     Predicting Media Memorability Task. In Working Notes Proceedings of
     the MediaEval 2021 Workshop.
[10] Ricardo Kleinlein, Cristina Luna-Jiménez, David Arias-Cuadrado,
     Javier Ferreiros, and Fernando Fernández-Martínez. 2021. Topic-
     Oriented Text Features Can Match Visual Deep Models of Video
     Memorability. Applied Sciences 11, 16 (2021). https://doi.org/10.3390/
     app11167406
[11] T. Konkle, T. F. Brady, G.A. Alvarez, and A. Oliva. 2010. Conceptual
     distinctiveness supports detailed visual long-term memory for real-
     world objects. Journal of Experimental Psychology: General 139, 3
     (2010), 558–578.
[12] Talia Konkle, Timothy F. Brady, George A. Alvarez, and Aude Oliva.
     2010. Scene Memory Is More Detailed Than You Think: The Role
     of Categories in Visual Long-Term Memory. Psychological Science
     21, 11 (2010), 1551–1556. https://doi.org/10.1177/0956797610385359
     arXiv:https://doi.org/10.1177/0956797610385359 PMID: 20921574.
[13] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O.
     Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas,
     A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay.
     2011. Scikit-learn: Machine Learning in Python. Journal of Machine
     Learning Research 12 (2011), 2825–2830.
[14] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel
     Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela
     Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learn-
     ing Transferable Visual Models From Natural Language Supervision.