Using Vision Transformers and Memorable Moments for the Prediction of Video Memorability Mihai Gabriel Constantin1 , Bogdan Ionescu1 1 University Politehnica of Bucharest, Romania mihai.constantin84@upb.ro ABSTRACT human perception of media data did not adhere to this general This paper describes the approach taken by the AI Multimedia Lab trend, as concepts like fusion, data manipulation and traditional team for the MediaEval 2021 Predicting Media Memorability task. feature extractors were sometimes more important in getting good Our approach is based on a Vision Transformer-based learning results than deep neural networks [3], indicating a need for deeply method, which is optimized by filtering the training sets for the understanding the data and the way it influences human subjects. two proposed datasets. We attempt to train the methods we propose Recently, Vision Transformers shown their usefulness for image with video segments that are more representative for the videos processing, surpassing convolutional approaches in image recog- they are part of. We test several types of filtering architectures, nition tasks [5]. To the best of our knowledge, this approach is and submit and test the architectures that best performed in our relatively untested in the domain of media memorability. This is preliminary studies. perhaps to be expected, as the rise of Vision Transformers is in itself a novelty at this point in time. 1 INTRODUCTION 3 APPROACH Media Memorability has attracted the attention of researchers from The general outline of our memorability prediction method is pre- different domains for a long time. This included studies that revealed sented in Figure 1. We propose creating a three stage system. In that humans have an uncanny ability of memorizing large quantities the first stage, we theorize that not all frames might be valuable of images, going so far as to correctly encode details from those for memorability calculation and therefore propose a frame filter- images. Generally speaking, there is a certain discrepancy between ing method. Following this, we extract visual features by using a the study of image and video memorability, with more attention Vision Transformer architecture, and, in a final step, we perform given in the current literature to the former. In this context, the regression with a dense MLP architecture. MediaEval Predicting Media Memorability task [6], now at its fourth edition, creates a common benchmarking task for predicting the Frame filtering. We base our frame filtering system on the as- short- and long-term memorability of videos. This task offers data sumption that not all frames are equal when trying to determine extracted and annotated from two datasets – TRECVid [1] and the properties of a larger video sequence. In our case, we propose Memento10k [8]; and proposes two opened task, related to direct using the annotations provided by the organizers for selecting the memorability prediction and generalization prediction between the frames that may best characterize the video from a memorability two datasets. standpoint. We call these frames "Memorable Moments", and while Our proposed method for video memorability prediction relies they may not represent the exact moment or the exact process of on the use of Vision Transformer networks for feature extraction, human memory retrieval, we theorize that they may represent a a dense network ending for sample regression and an important better approach than simply attempting to use the entire video for frame filtering method that attempts to use the most memorable processing. moments from the video samples in the training process. The rest of We test several setups for the frame filtering method as follows. the paper is organized as follows: Section 2 presents the works most First of all, we have to take into account the lag time between related to our proposed approach, while our method is presented human memory recognition and button press. Therefore, given in Section 3. Section 4 presents the results both in our training and 𝑟𝑡, a user’s response time in milliseconds from the start of the development process and on the final testing set. Finally, the main video, we subtract the following values: 500, 1000, 1500 milliseconds conclusions are presented in Section 5. from the 𝑟𝑡 value in order to define the actual time of retrieval from memory. Of course we cap the resulting value at zero in case 2 RELATED WORK retrieval occurred very close to the start of the video. Furthermore, we take a variable number of frames, namely 15, 30, 60 from the Deep Neural Networks have come a long way in addressing many resulting location and use them for analysis. We will compare our machine learning problems, and for a long time, starting with the filtering method (which we call 𝑅2) against a default method, where success of AlexNet [7], when it came to visual data processing all the video is taken into consideration (called 𝑅1). in general, convolutional neural networks were the norm in get- ting the best results. Interestingly, some domains related to the Visual Features. For visual feature extractors, we test two popular Vision Transformer architectures, namely the DeiT [9] and the Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). BEiT [2]. No special fusion will be employed with these two features, MediaEval’21, December 13-15 2021, Online as at this stage we will test them separately and choose the best performing one for the final set of experiments. MediaEval’21, December 13-15 2021, Online G. Constantin et al. ... Vision Memorability Frame Selection Dense Transformer Score Figure 1: The diagram of the proposed frame filtering solution. The Frame Selection phase uses the most representative images from the entire set of frames in a video, that are then processed by a Vision Transformer architecture and processed by a Dense MLP head in order to obtain the final Memorability Score. Annotator picks on the training set for the Memorable Moments are presented with a green tick mark. R1 R2 Spearman Pearson MSE Spearman Pearson MSE short-raw 0.293 0.312 0.01 0.297 0.311 0.01 TRECVid short-normalized 0.26 0.267 0.01 0.251 0.224 0.01 Subtask 1 long 0.079 0.102 0.01 0.097 0.114 0.04 short-raw 0.407 0.409 0.01 0.648 0.652 0.01 Memento10K short-normalized 0.641 0.641 0.01 0.648 0.65 0.01 Subtask2 TRECVid short-raw 0.089 0.11 0.02 0.091 0.108 0.01 Table 1: Results of the submitted systems on the two subtasks. R1 results are for the non-filtered systems, while R2 results present the filtered version of the systems. Dense MLP. The final stage consists of classifying the chosen of the Predicting Media Memorability task [4], we observe lower features extracted from the selected frames and outputting the final performance for long-term memorability prediction compared to memorability score. This is done via a simple dense architecture short term. with 3 hidden layers of size 1024, 512, and 256. Finally, regarding the generalization subtask (subtask 2), we find a significant drop in performance when compared to subtask 1. This 4 RESULTS AND ANALYSIS may be due to differences in the types of movies in the dataset, but In the first stage of development, we test the setups proposed in methods that reduce this issue must definitely be studied. the previous Section, by training on the Memento training set and testing on its development set. With regards to the frame filtering 5 CONCLUSIONS method, we find that a setup of 1000 milliseconds delay in response In this paper we present a media memorability prediction method time and 30 frames analyzed are the best setups, though not by a that is based on the use of Vision Transformer architectures and significant margin. For the Transformer architecture, we select the frame filtering method we call Memorable Moments. Our experi- DeiT architecture as the best performer in these preliminary tests, ments show good results for both these components and, for future though again not by a large margin. developments we propose improving this framework by testing The final results computed on the testset are presented in Table more feature extraction architectures, performing tests against con- 1. It is interesting to notice that, in five out of the six (𝑅1, 𝑅2) volutional architectures, predicting Memorable Moments on the comparison pairs, the results were better for the variant of the testset, and testing this type of approach on other subjective multi- system which employed filtered training, via Memorable Moments, media concepts and properties. while at times even being so with a significant margin. For the prediction subtask (subtask 1) we find that results for the ACKNOWLEDGMENTS Memento10K prediction are much better than the ones for TRECVid. This work was funded under project AI4Media “A European Ex- This may be a result of many factors, but one of them may be cellence Centre for Media, Society and Democracy”, grant 951911, represented by the lower number of video samples in the latter H2020 ICT-48-2020. dataset. Also, continuing the trend recorded at the previous version Predicting Media Memorability MediaEval’21, December 13-15 2021, Online REFERENCES [1] George Awad, Asad A Butt, Keith Curtis, Yooyoung Lee, Jonathan Fiscus, Afzal Godil, Andrew Delgado, Jesse Zhang, Eliot Godard, Lukas Diduch, and others. 2020. Trecvid 2019: An evaluation campaign to benchmark video activity detection, video captioning and matching, and video search & retrieval. arXiv preprint arXiv:2009.09984 (2020). [2] Hangbo Bao, Li Dong, and Furu Wei. 2021. BEiT: BERT Pre-Training of Image Transformers. arXiv preprint arXiv:2106.08254 (2021). [3] Mihai Gabriel Constantin, Liviu-Daniel Ştefan, Bogdan Ionescu, Ngoc QK Duong, Claire-Hélène Demarty, and Mats Sjöberg. 2021. Visual Interestingness Prediction: A Benchmark Framework and Lit- erature Review. International Journal of Computer Vision (2021), 1–25. [4] Alba García Seco De Herrera, Rukiye Savran Kiziltepe, Jon Chamber- lain, Mihai Gabriel Constantin, Claire-Hélène Demarty, Faiyaz Doctor, Bogdan Ionescu, and Alan F Smeaton. 2020. Overview of MediaEval 2020 Predicting Media Memorability Task: What Makes a Video Mem- orable? Proceedings of MediaEval’20 (2020). [5] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weis- senborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, and others. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020). [6] Rukiye Savran Kiziltepe, Mihai Gabriel Constantin, Claire-Hélène Demarty, Graham Healy, Camilo Fosco, Alba García Seco de Herrera, Sebastian Halder, Bogdan Ionescu, Ana Matran-Fernandez, Alan F. Smeaton, and Lorin Sweeney. 2021. Overview of The MediaEval 2021 Predicting Media Memorability Task. In Proceedings of MediaEval’21. [7] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Im- agenet classification with deep convolutional neural networks. Ad- vances in neural information processing systems 25 (2012), 1097–1105. [8] Anelise Newman, Camilo Fosco, Vincent Casser, Allen Lee, Barry Mc- Namara, and Aude Oliva. 2020. Multimodal memorability: Modeling effects of semantics and decay on video memorability. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16. Springer, 223–240. [9] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning. PMLR, 10347–10357.