Exploring Video Transformers and Automatic Segment Selection for Memorability Prediction Iván Martín-Fernández1,* , Sergio Esteban-Romero1 , Jaime Bellver-Soler1 , Manuel Gil-Martín1 and Fernando Fernández-Martínez1 1 Grupo de Tecnología del Habla y Aprendizaje Automático (THAU Group), Information Processing and Telecommunications Center, E.T.S.I. de Telecomunicación, Universidad Politécnica de Madrid (UPM) Abstract This paper summarises THAU-UPM’s approach and results from the MediaEval 2023 Predicting Video Memorability task. Focused on the generalisation subtask, our work leverages a pre-trained Video Vision Transformer (ViViT), fine-tuned on memorability-related data, to model temporal and spatial relationships in videos. We propose novel, annotator-independent automatic segment selection methods grounded in visual saliency. These methods identify the most relevant video frames prior to conducting memorability score estimation. This selection process is implemented during both training and evaluation phases. Our study demonstrates the effectiveness of fine-tuning the ViViT model compared to a scratch- trained baseline, emphasising the importance of pre-training for predicting memorability. However, the model shows comparable sensitivity to both saliency-based and naive segment selection methods, suggesting that fine-tuning may harness similar benefits from various video segments. These results underscore the robustness of our approach but also signal the need for ongoing research. 1. Introduction and Motivating Work Memorability is an aspect of human perception that has attracted the interest of researchers in psychology, neuroscience and computer science alike due to its relevance to areas as diverse as disease diagnosis, marketing and education. Taking advantage of the burgeoning advances in artificial intelligence architectures for media retrieval, classification and analysis as a proxy for modelling the connections between human senses and our understanding of the world through cognitive processes is particularly appealing, which explains the steady stream of work on the subject in recent years. The MediaEval Predicting Video Memorability task, currently in its sixth edition [1], plays an important role in this effort. This contribution focuses on the generalisation subtask, focused on training systems that are able to learn general knowledge about the task that can be tested using different datasets. To the best of our knowledge, most recent tackles on the Predicting Video Memorability task rely on using image-level architectures to extract knowledge from a handful of frames and then performing some sort of fusion strategy to obtain a single representation for the entire video, using powerful image-only backbone models such as the Vision Transformer but neglecting architectures that use video itself as input [2, 3, 4]. A notable exception comes from Constantin MediaEval’23: Multimedia Evaluation Workshop, February 1–2, 2024, Amsterdam, The Netherlands and Online * Corresponding author. $ ivan.martinf@upm.es (I. Martín-Fernández); sergio.estebanro@upm.es (S. Esteban-Romero); jaime.bellver@upm.es (J. Bellver-Soler); manuel.gilmartin@upm.es (M. Gil-Martín); fernando.fernandezm@upm.es (F. Fernández-Martínez)  0009-0004-2769-9752 (I. Martín-Fernández); 0009-0008-6336-7877 (S. Esteban-Romero); 0009-0006-7973-4913 (J. Bellver-Soler); 0000-0002-4285-6224 (M. Gil-Martín); 0000-0003-3877-0089 (F. Fernández-Martínez) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings and Ionescu [5], who train a Video Vision Transformer (ViViT) [6] to predict memorability from video segments, thus integrating the temporal aspect of videos into the core architecture of the system. They also present a technique for selecting which video segments are used to train and evaluate the model, based on the time it took annotators to recall watching a video. Although the authors prove the effectiveness of this method, we aim to develop an alternative that is based purely on input data and can therefore be used in the absence of this time-specific annotation. Furthermore, our strategy can be used for both training and evaluation, which we argue is an advantage over annotation-based approaches where an arbitrary segment selection method has to be designed for the testing phase in order to avoid data leakage. In the spirit of transfer learning and generalisation, we propose to fine-tune a Video Vision Transformer, pre-trained on a generic video classification task, on memorability related data. Furthermore, we evaluate different selection strategies, where video segments are fed into the model in both training and evaluation steps. 2. Approach We hypothesise that the ViViT architecture has the potential to be a robust, data agnostic model for memorability prediction, and therefore perform well in the generalisation task scenario. With this in mind, our approach is based on incorporating generic knowledge into the training process using two complementary strategies: a) fine-tuning a pre-trained ViViT model instead of starting training from scratch, and b) proposing automatic segment selection methods that do not rely on annotator data. 2.1. Fine-tuning Video Transformers The ViViT Transformer is an adaptation of the original Vision Transformer that is able to process and model temporal relationships between frames as well as the spatial relationships that appear in each image by including a three-dimensional Tubelet Embedding encoder before the Transformer input. We start our training from the official ViViT checkpoint available at huggingface1 . Its training data, Kinetics 400 [7], consists of 10-second clips extracted from YouTube videos and depicting one of 400 possible human actions, with a minimum of 400 clips per action class. We believe that modelling the subtleties of anthropoid imagery with this vast amount of content is key in understanding media memorability, as there is a direct relationship between both concepts [8]. Our regression head consists of a linear layer followed by a sigmoid activation function, which is appended to the last hidden state of the final encoder. This design operates under the hypothesis that this representation is inherently meaningful, requiring no further transformations. We train on a single 32-frame long segment extracted from each video from the training set, using one of the segment selection methods that will be described next. The frame number selection is imposed by the architecture of the model that we wish to fine-tune. In order to compare our fine-tuning proposal, we train a baseline ViViT model from scratch, using the implementation proposed in [5] (i.e., 15 frames per segment, 8 attention heads per Transformer encoder, and 8 encoders). This baseline is trained on every possible 15 frame segment that can be extracted from each of the videos in the training set, so as to maximise the amount of information that is used for learning. We aim to test whether this simpler architecture can trade the lack of pre-training data with the ability of generating more meaningful representations of memorability-related videos. 1 https://huggingface.co/google/vivit-b-16x2-kinetics400 Figure 1: Saliency maps for a sample frame. Whitest pixels are the predicted most salient. 2.2. Designing an automatic segment selection method Using [3] as reference, we elaborate on the idea of selecting the most representative segment for the video and propose a novel method that is annotator-independent and selects the most relevant set of frames only using visual information, instead of relying on label related data. Based on the existing conception that saliency, defined as the prominence of features within an image that naturally attract human attention, is closesly related to memorability [9, 10, 11], we propose a method that automatically selects the most salient segment of a video and use it as input. We compare two different methods for computing image saliency. The first one, based on [12] and denoted Fine Grained according to the OpenCV implementation [13], involves analysing localised variations in the image to identify salient regions. The second method, Spectral Residual [14], identifies areas that stand out in the spectral domain of an image. By comparing these approaches, we aim to determine if the nuanced detail detection of the Fine Grained method or the global anomaly identification of the Spectral Residual approach is more effective in isolating memorable segments in videos. To identify the most representative video segment, we calculate the total pixel saliency across all frames, sum the saliency within a sliding window of 𝑛 = 32 frames, and normalsze these values. The frame with the highest normalized window saliency and its adjacent 𝑛 frames are then selected. To test our approach, we compare it to two image-agnostic baseline methods: Uniform Sampling of 𝑛 frames from the entire clip, and extracting the 𝑛 frames from the Center Segment of the video. 3. Results and Discussion As a preliminary study, we compare our fine-tuning approach with the from scratch baseline in order to analyse the effect of progressively unfreezing the weights of the Transformer encoders, starting from the one next to the regression head and going towards the input. We resort to the Uniform Sampling method for fine-tuning in this step. The results in term of Spearman Rank Correlation Coefficient (SRCC), the official metric for the task, are shown in Table 1, where we observe that our fine-tuning proposal significantly outperforms the baseline with just a single unfrozen encoder. This supports our idea that the ViViT model is greatly benefitted by a pre-training step in which general knowledge is acquired, and that it can translate this learnt relationships into the memorability problem. On the other hand, the fact that our best result comes from unfreezing all the model weights and letting it update as a whole leads us to think that the specific visual and semantic language related to the task still plays a crucial role in its solving, and therefore this aforementioned generic knowledge must be conditioned to it. This synergy between broad and specific expertise encourage us to use the fine-tuning approach for our runs, and to explore whether an automatic segment selection can enhance the adaptation process. With this in mind, we show our final testing set results for our runs in Table 2, where we Table 1 Results on the Memento10k dev set, where the baseline is compared with different fine-tuning strategies. # Unfrozen encoders Baseline 1 3 5 All SRCC 0.4119 0.5573 0.5663 0.6274 0.6529 Table 2 SRCC results for the different segment selection strategies. Segment Selection Strategy Memento10k dev set VideoMem test set Uniform Sampling 0.653 0.437 Center Segment 0.651 0.440 Salient Segment - Fine Grained 0.657 0.441 Salient Segment - Spectral Residual 0.640 0.433 compare the different segment selection methods. We perceive that there is no significant difference between the saliency based methods and the naive approaches used for comparison, neither on the Memento10k developing set nor in the VideoMem test set, apart from a slight drop in performance when using the Spectral Residual method, indicating that the relationship between the spectral characteristics of an image and its memorability is somewhat weaker than a more nuanced approach. As can be seen in Figure 1, the Fine-Grained saliency maps are more detailed, in contrast with the less defined aspect of the Spectral Residual, which may influence on the selected segment. However, it seems that fine-tuned method benefits equally from segments across the whole video, independently of which part of it is used as input. Although we believe this is a sign of the robustness of our proposal, a more in-depth analysis of the relationship between image saliency and annotators response in terms of memorability could possibly further enhance the capabilities of this type of architecture. 4. Conclusions In this paper we outline our contribution to the MediaEval 2023 Predicting Video Memora- bility task. We propose to leverage pre-trained Video Transformers in order to create robust memorability predictors that take sequences of frames as input. We also explore automatic segment selection methods based on saliency. Our results show that fine-tuning significantly outperforms training from scratch on our setup, but that the model is not specially sensible to automatic selection methods. We aim to deepen our exploration on the matter by developing advanced methods based on saliency and other perceptual features that output multiple candi- date segments in order to broaden the training information, as well as evaluating the potential benefits of these methods on models trained from scratch. Acknowledgments We would like to thank M. Gabriel Constantin for his insights on his work, which have been greatly helpful for our research. I.M.-F.’s research was supported by the UPM (Programa Propio I+D+i). This work was funded by Project ASTOUND (101071191 — HORIZON-EIC-2021- PATHFINDERCHALLENGES-01) of the European Commission and by the Spanish Ministry of Science and Innovation through the projects GOMINOLA (PID2020-118112RB-C22) and BeWord (PID2021-126061OB-C43), funded by MCIN/AEI/10.13039/501100011033 and by the European Union “NextGenerationEU/PRTR” References [1] M. G. Constantin, C.-H. Demarty, C. Fosco, A. García Seco de Herrera, S. Halder, G. Healy, B. Ionescu, A. Matran-Fernandez, R. Savran Kiziltepe, A. F. Smeaton, L. Sweeney, Overview of the mediaeval 2023 predicting video memorability task, in: Proc. of the MediaEval 2023 Workshop, Amsterdam, The Netherlands and Online, 2024. [2] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020). [3] M. G. Constantin, B. Ionescu, Using vision transformers and memorable moments for the prediction of video memorability, in: MediaEval 2021 workshop, 2021. [4] M. Agarla, L. Celona, R. Schettini, et al., Predicting video memorability using a model pretrained with natural language supervision, in: MediaEval Multimedia Benchmark Workshop 2022 Working Notes, 2023. [5] M. G. Constantin, B. Ionescu, Aimultimedialab at mediaeval 2022: Predicting media memorability using video vision transformers and augmented memorable moments, Working Notes Proceedings of the MediaEval 2022 Workshop (2023). [6] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, C. Schmid, Vivit: A video vision transformer, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 6836–6846. [7] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al., The kinetics human action video dataset, arXiv preprint arXiv:1705.06950 (2017). [8] P. Isola, D. Parikh, A. Torralba, A. Oliva, Understanding the intrinsic memorability of images, Advances in neural information processing systems 24 (2011). [9] R. Dubey, J. Peterson, A. Khosla, M.-H. Yang, B. Ghanem, What makes an object memorable?, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015. [10] M. Mancas, O. Le Meur, Memorability of natural scenes: The role of attention, in: 2013 IEEE Inter- national Conference on Image Processing, 2013, pp. 196–200. doi:10.1109/ICIP.2013.6738041. [11] V. Mudgal, Q. Wang, L. Sweeney, A. F. Smeaton, Using saliency and cropping to improve video memorability, arXiv preprint arXiv:2309.11881 (2023). [12] S. Montabone, A. Soto, Human detection using a mobile platform and novel features derived from a visual saliency mechanism, Image and Vision Computing 28 (2010) 391–402. [13] G. Bradski, The OpenCV Library, Dr. Dobb’s Journal of Software Tools (2000). [14] X. Hou, L. Zhang, Saliency detection: A spectral residual approach, in: 2007 IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8. doi:10.1109/CVPR.2007.383267.