=Paper= {{Paper |id=Vol-3181/paper73 |storemode=property |title=Using Vision Transformers and Memorable Moments for the Prediction of Video Memorability |pdfUrl=https://ceur-ws.org/Vol-3181/paper73.pdf |volume=Vol-3181 |authors=Mihai Gabriel Constantin,Bogdan Ionescu |dblpUrl=https://dblp.org/rec/conf/mediaeval/ConstantinI21 }} ==Using Vision Transformers and Memorable Moments for the Prediction of Video Memorability== https://ceur-ws.org/Vol-3181/paper73.pdf
     Using Vision Transformers and Memorable Moments for the
                  Prediction of Video Memorability
                                                 Mihai Gabriel Constantin1 , Bogdan Ionescu1
                                                      1 University Politehnica of Bucharest, Romania

                                                                   mihai.constantin84@upb.ro

ABSTRACT                                                                             human perception of media data did not adhere to this general
This paper describes the approach taken by the AI Multimedia Lab                     trend, as concepts like fusion, data manipulation and traditional
team for the MediaEval 2021 Predicting Media Memorability task.                      feature extractors were sometimes more important in getting good
Our approach is based on a Vision Transformer-based learning                         results than deep neural networks [3], indicating a need for deeply
method, which is optimized by filtering the training sets for the                    understanding the data and the way it influences human subjects.
two proposed datasets. We attempt to train the methods we propose                       Recently, Vision Transformers shown their usefulness for image
with video segments that are more representative for the videos                      processing, surpassing convolutional approaches in image recog-
they are part of. We test several types of filtering architectures,                  nition tasks [5]. To the best of our knowledge, this approach is
and submit and test the architectures that best performed in our                     relatively untested in the domain of media memorability. This is
preliminary studies.                                                                 perhaps to be expected, as the rise of Vision Transformers is in
                                                                                     itself a novelty at this point in time.

1    INTRODUCTION                                                                    3   APPROACH
Media Memorability has attracted the attention of researchers from                   The general outline of our memorability prediction method is pre-
different domains for a long time. This included studies that revealed               sented in Figure 1. We propose creating a three stage system. In
that humans have an uncanny ability of memorizing large quantities                   the first stage, we theorize that not all frames might be valuable
of images, going so far as to correctly encode details from those                    for memorability calculation and therefore propose a frame filter-
images. Generally speaking, there is a certain discrepancy between                   ing method. Following this, we extract visual features by using a
the study of image and video memorability, with more attention                       Vision Transformer architecture, and, in a final step, we perform
given in the current literature to the former. In this context, the                  regression with a dense MLP architecture.
MediaEval Predicting Media Memorability task [6], now at its fourth
edition, creates a common benchmarking task for predicting the                           Frame filtering. We base our frame filtering system on the as-
short- and long-term memorability of videos. This task offers data                   sumption that not all frames are equal when trying to determine
extracted and annotated from two datasets – TRECVid [1] and                          the properties of a larger video sequence. In our case, we propose
Memento10k [8]; and proposes two opened task, related to direct                      using the annotations provided by the organizers for selecting the
memorability prediction and generalization prediction between the                    frames that may best characterize the video from a memorability
two datasets.                                                                        standpoint. We call these frames "Memorable Moments", and while
   Our proposed method for video memorability prediction relies                      they may not represent the exact moment or the exact process of
on the use of Vision Transformer networks for feature extraction,                    human memory retrieval, we theorize that they may represent a
a dense network ending for sample regression and an important                        better approach than simply attempting to use the entire video for
frame filtering method that attempts to use the most memorable                       processing.
moments from the video samples in the training process. The rest of                      We test several setups for the frame filtering method as follows.
the paper is organized as follows: Section 2 presents the works most                 First of all, we have to take into account the lag time between
related to our proposed approach, while our method is presented                      human memory recognition and button press. Therefore, given
in Section 3. Section 4 presents the results both in our training and                𝑟𝑡, a user’s response time in milliseconds from the start of the
development process and on the final testing set. Finally, the main                  video, we subtract the following values: 500, 1000, 1500 milliseconds
conclusions are presented in Section 5.                                              from the 𝑟𝑡 value in order to define the actual time of retrieval
                                                                                     from memory. Of course we cap the resulting value at zero in case
2    RELATED WORK                                                                    retrieval occurred very close to the start of the video. Furthermore,
                                                                                     we take a variable number of frames, namely 15, 30, 60 from the
Deep Neural Networks have come a long way in addressing many
                                                                                     resulting location and use them for analysis. We will compare our
machine learning problems, and for a long time, starting with the
                                                                                     filtering method (which we call 𝑅2) against a default method, where
success of AlexNet [7], when it came to visual data processing
                                                                                     all the video is taken into consideration (called 𝑅1).
in general, convolutional neural networks were the norm in get-
ting the best results. Interestingly, some domains related to the                       Visual Features. For visual feature extractors, we test two popular
                                                                                     Vision Transformer architectures, namely the DeiT [9] and the
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
                                                                                     BEiT [2]. No special fusion will be employed with these two features,
MediaEval’21, December 13-15 2021, Online                                            as at this stage we will test them separately and choose the best
                                                                                     performing one for the final set of experiments.
MediaEval’21, December 13-15 2021, Online                                                                                    G. Constantin et al.

        ...




                                                                                          Vision                       Memorability
                                         Frame Selection                                                   Dense
                                                                                       Transformer                       Score


Figure 1: The diagram of the proposed frame filtering solution. The Frame Selection phase uses the most representative images
from the entire set of frames in a video, that are then processed by a Vision Transformer architecture and processed by a Dense
MLP head in order to obtain the final Memorability Score. Annotator picks on the training set for the Memorable Moments are
presented with a green tick mark.

                                                                    R1                           R2
                                                          Spearman Pearson MSE Spearman Pearson MSE
                                         short-raw          0.293    0.312     0.01    0.297      0.311   0.01
                          TRECVid        short-normalized   0.26     0.267     0.01    0.251      0.224   0.01
               Subtask 1                 long               0.079    0.102     0.01    0.097      0.114   0.04
                                         short-raw          0.407    0.409     0.01    0.648      0.652   0.01
                          Memento10K
                                         short-normalized   0.641    0.641     0.01    0.648       0.65   0.01
               Subtask2 TRECVid          short-raw          0.089     0.11     0.02    0.091      0.108   0.01
Table 1: Results of the submitted systems on the two subtasks. R1 results are for the non-filtered systems, while R2 results
present the filtered version of the systems.



   Dense MLP. The final stage consists of classifying the chosen         of the Predicting Media Memorability task [4], we observe lower
features extracted from the selected frames and outputting the final     performance for long-term memorability prediction compared to
memorability score. This is done via a simple dense architecture         short term.
with 3 hidden layers of size 1024, 512, and 256.                            Finally, regarding the generalization subtask (subtask 2), we find
                                                                         a significant drop in performance when compared to subtask 1. This
4   RESULTS AND ANALYSIS                                                 may be due to differences in the types of movies in the dataset, but
In the first stage of development, we test the setups proposed in        methods that reduce this issue must definitely be studied.
the previous Section, by training on the Memento training set and
testing on its development set. With regards to the frame filtering      5   CONCLUSIONS
method, we find that a setup of 1000 milliseconds delay in response      In this paper we present a media memorability prediction method
time and 30 frames analyzed are the best setups, though not by a         that is based on the use of Vision Transformer architectures and
significant margin. For the Transformer architecture, we select the      frame filtering method we call Memorable Moments. Our experi-
DeiT architecture as the best performer in these preliminary tests,      ments show good results for both these components and, for future
though again not by a large margin.                                      developments we propose improving this framework by testing
   The final results computed on the testset are presented in Table      more feature extraction architectures, performing tests against con-
1. It is interesting to notice that, in five out of the six (𝑅1, 𝑅2)     volutional architectures, predicting Memorable Moments on the
comparison pairs, the results were better for the variant of the         testset, and testing this type of approach on other subjective multi-
system which employed filtered training, via Memorable Moments,          media concepts and properties.
while at times even being so with a significant margin.
   For the prediction subtask (subtask 1) we find that results for the   ACKNOWLEDGMENTS
Memento10K prediction are much better than the ones for TRECVid.         This work was funded under project AI4Media “A European Ex-
This may be a result of many factors, but one of them may be             cellence Centre for Media, Society and Democracy”, grant 951911,
represented by the lower number of video samples in the latter           H2020 ICT-48-2020.
dataset. Also, continuing the trend recorded at the previous version
Predicting Media Memorability                                                 MediaEval’21, December 13-15 2021, Online


REFERENCES
 [1] George Awad, Asad A Butt, Keith Curtis, Yooyoung Lee, Jonathan
     Fiscus, Afzal Godil, Andrew Delgado, Jesse Zhang, Eliot Godard, Lukas
     Diduch, and others. 2020. Trecvid 2019: An evaluation campaign to
     benchmark video activity detection, video captioning and matching,
     and video search & retrieval. arXiv preprint arXiv:2009.09984 (2020).
 [2] Hangbo Bao, Li Dong, and Furu Wei. 2021. BEiT: BERT Pre-Training
     of Image Transformers. arXiv preprint arXiv:2106.08254 (2021).
 [3] Mihai Gabriel Constantin, Liviu-Daniel Ştefan, Bogdan Ionescu,
     Ngoc QK Duong, Claire-Hélène Demarty, and Mats Sjöberg. 2021.
     Visual Interestingness Prediction: A Benchmark Framework and Lit-
     erature Review. International Journal of Computer Vision (2021), 1–25.
 [4] Alba García Seco De Herrera, Rukiye Savran Kiziltepe, Jon Chamber-
     lain, Mihai Gabriel Constantin, Claire-Hélène Demarty, Faiyaz Doctor,
     Bogdan Ionescu, and Alan F Smeaton. 2020. Overview of MediaEval
     2020 Predicting Media Memorability Task: What Makes a Video Mem-
     orable? Proceedings of MediaEval’20 (2020).
 [5] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weis-
     senborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani,
     Matthias Minderer, Georg Heigold, Sylvain Gelly, and others. 2020.
     An image is worth 16x16 words: Transformers for image recognition
     at scale. arXiv preprint arXiv:2010.11929 (2020).
 [6] Rukiye Savran Kiziltepe, Mihai Gabriel Constantin, Claire-Hélène
     Demarty, Graham Healy, Camilo Fosco, Alba García Seco de Herrera,
     Sebastian Halder, Bogdan Ionescu, Ana Matran-Fernandez, Alan F.
     Smeaton, and Lorin Sweeney. 2021. Overview of The MediaEval 2021
     Predicting Media Memorability Task. In Proceedings of MediaEval’21.
 [7] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Im-
     agenet classification with deep convolutional neural networks. Ad-
     vances in neural information processing systems 25 (2012), 1097–1105.
 [8] Anelise Newman, Camilo Fosco, Vincent Casser, Allen Lee, Barry Mc-
     Namara, and Aude Oliva. 2020. Multimodal memorability: Modeling
     effects of semantics and decay on video memorability. In Computer
     Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August
     23–28, 2020, Proceedings, Part XVI 16. Springer, 223–240.
 [9] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa,
     Alexandre Sablayrolles, and Hervé Jégou. 2021. Training data-efficient
     image transformers & distillation through attention. In International
     Conference on Machine Learning. PMLR, 10347–10357.