<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploring Video Transformers and Automatic Segment Selection for Memorability Prediction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Iván Martín-Fernández</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sergio Esteban-Romero</string-name>
          <email>sergio.estebanro@upm.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jaime Bellver-Soler</string-name>
          <email>jaime.bellver@upm.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manuel Gil-Martín</string-name>
          <email>manuel.gilmartin@upm.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fernando Fernández-Martínez</string-name>
          <email>fernando.fernandezm@upm.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Grupo de Tecnología del Habla y Aprendizaje Automático (THAU Group), Information Processing and Telecommunications Center, E.T.S.I. de Telecomunicación, Universidad Politécnica de Madrid</institution>
          ,
          <addr-line>UPM</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper summarises THAU-UPM's approach and results from the MediaEval 2023 Predicting Video Memorability task. Focused on the generalisation subtask, our work leverages a pre-trained Video Vision Transformer (ViViT), fine-tuned on memorability-related data, to model temporal and spatial relationships in videos. We propose novel, annotator-independent automatic segment selection methods grounded in visual saliency. These methods identify the most relevant video frames prior to conducting memorability score estimation. This selection process is implemented during both training and evaluation phases. Our study demonstrates the efectiveness of fine-tuning the ViViT model compared to a scratchtrained baseline, emphasising the importance of pre-training for predicting memorability. However, the model shows comparable sensitivity to both saliency-based and naive segment selection methods, suggesting that fine-tuning may harness similar benefits from various video segments. These results underscore the robustness of our approach but also signal the need for ongoing research.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Motivating Work</title>
      <p>Memorability is an aspect of human perception that has attracted the interest of researchers in
psychology, neuroscience and computer science alike due to its relevance to areas as diverse as
disease diagnosis, marketing and education. Taking advantage of the burgeoning advances in
artificial intelligence architectures for media retrieval, classification and analysis as a proxy for
modelling the connections between human senses and our understanding of the world through
cognitive processes is particularly appealing, which explains the steady stream of work on the
subject in recent years.</p>
      <p>
        The MediaEval Predicting Video Memorability task, currently in its sixth edition [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], plays an
important role in this efort. This contribution focuses on the generalisation subtask, focused
on training systems that are able to learn general knowledge about the task that can be tested
using diferent datasets.
      </p>
      <p>
        To the best of our knowledge, most recent tackles on the Predicting Video Memorability task
rely on using image-level architectures to extract knowledge from a handful of frames and then
performing some sort of fusion strategy to obtain a single representation for the entire video,
using powerful image-only backbone models such as the Vision Transformer but neglecting
architectures that use video itself as input [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ]. A notable exception comes from Constantin
and Ionescu [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], who train a Video Vision Transformer (ViViT) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to predict memorability
from video segments, thus integrating the temporal aspect of videos into the core architecture
of the system. They also present a technique for selecting which video segments are used to
train and evaluate the model, based on the time it took annotators to recall watching a video.
Although the authors prove the efectiveness of this method, we aim to develop an alternative
that is based purely on input data and can therefore be used in the absence of this time-specific
annotation. Furthermore, our strategy can be used for both training and evaluation, which we
argue is an advantage over annotation-based approaches where an arbitrary segment selection
method has to be designed for the testing phase in order to avoid data leakage.
      </p>
      <p>In the spirit of transfer learning and generalisation, we propose to fine-tune a Video Vision
Transformer, pre-trained on a generic video classification task, on memorability related data.
Furthermore, we evaluate diferent selection strategies, where video segments are fed into the
model in both training and evaluation steps.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Approach</title>
      <p>We hypothesise that the ViViT architecture has the potential to be a robust, data agnostic model
for memorability prediction, and therefore perform well in the generalisation task scenario.
With this in mind, our approach is based on incorporating generic knowledge into the training
process using two complementary strategies: a) fine-tuning a pre-trained ViViT model instead
of starting training from scratch, and b) proposing automatic segment selection methods that
do not rely on annotator data.</p>
      <sec id="sec-2-1">
        <title>2.1. Fine-tuning Video Transformers</title>
        <p>
          The ViViT Transformer is an adaptation of the original Vision Transformer that is able to
process and model temporal relationships between frames as well as the spatial relationships
that appear in each image by including a three-dimensional Tubelet Embedding encoder before
the Transformer input. We start our training from the oficial ViViT checkpoint available at
huggingface1. Its training data, Kinetics 400 [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], consists of 10-second clips extracted from
YouTube videos and depicting one of 400 possible human actions, with a minimum of 400 clips
per action class. We believe that modelling the subtleties of anthropoid imagery with this vast
amount of content is key in understanding media memorability, as there is a direct relationship
between both concepts [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Our regression head consists of a linear layer followed by a sigmoid
activation function, which is appended to the last hidden state of the final encoder. This design
operates under the hypothesis that this representation is inherently meaningful, requiring no
further transformations.
        </p>
        <p>
          We train on a single 32-frame long segment extracted from each video from the training set,
using one of the segment selection methods that will be described next. The frame number
selection is imposed by the architecture of the model that we wish to fine-tune. In order to
compare our fine-tuning proposal, we train a baseline ViViT model from scratch, using the
implementation proposed in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] (i.e., 15 frames per segment, 8 attention heads per Transformer
encoder, and 8 encoders). This baseline is trained on every possible 15 frame segment that
can be extracted from each of the videos in the training set, so as to maximise the amount of
information that is used for learning. We aim to test whether this simpler architecture can trade
the lack of pre-training data with the ability of generating more meaningful representations of
memorability-related videos.
1https://huggingface.co/google/vivit-b-16x2-kinetics400
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Designing an automatic segment selection method</title>
        <p>
          Using [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] as reference, we elaborate on the idea of selecting the most representative segment
for the video and propose a novel method that is annotator-independent and selects the most
relevant set of frames only using visual information, instead of relying on label related data.
Based on the existing conception that saliency, defined as the prominence of features within an
image that naturally attract human attention, is closesly related to memorability [
          <xref ref-type="bibr" rid="ref10 ref11 ref9">9, 10, 11</xref>
          ], we
propose a method that automatically selects the most salient segment of a video and use it as
input. We compare two diferent methods for computing image saliency. The first one, based
on [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] and denoted Fine Grained according to the OpenCV implementation [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], involves
analysing localised variations in the image to identify salient regions. The second method,
Spectral Residual [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], identifies areas that stand out in the spectral domain of an image. By
comparing these approaches, we aim to determine if the nuanced detail detection of the Fine
Grained method or the global anomaly identification of the Spectral Residual approach is more
efective in isolating memorable segments in videos. To identify the most representative video
segment, we calculate the total pixel saliency across all frames, sum the saliency within a sliding
window of  = 32 frames, and normalsze these values. The frame with the highest normalized
window saliency and its adjacent  frames are then selected.
        </p>
        <p>To test our approach, we compare it to two image-agnostic baseline methods: Uniform
Sampling of  frames from the entire clip, and extracting the  frames from the Center
Segment of the video.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results and Discussion</title>
      <p>As a preliminary study, we compare our fine-tuning approach with the from scratch baseline in
order to analyse the efect of progressively unfreezing the weights of the Transformer encoders,
starting from the one next to the regression head and going towards the input. We resort to the
Uniform Sampling method for fine-tuning in this step. The results in term of Spearman Rank
Correlation Coeficient (SRCC), the oficial metric for the task, are shown in Table 1, where
we observe that our fine-tuning proposal significantly outperforms the baseline with just a
single unfrozen encoder. This supports our idea that the ViViT model is greatly benefitted by a
pre-training step in which general knowledge is acquired, and that it can translate this learnt
relationships into the memorability problem. On the other hand, the fact that our best result
comes from unfreezing all the model weights and letting it update as a whole leads us to think
that the specific visual and semantic language related to the task still plays a crucial role in its
solving, and therefore this aforementioned generic knowledge must be conditioned to it. This
synergy between broad and specific expertise encourage us to use the fine-tuning approach for
our runs, and to explore whether an automatic segment selection can enhance the adaptation
process.</p>
      <p>With this in mind, we show our final testing set results for our runs in Table 2, where we
compare the diferent segment selection methods. We perceive that there is no significant
diference between the saliency based methods and the naive approaches used for comparison,
neither on the Memento10k developing set nor in the VideoMem test set, apart from a slight
drop in performance when using the Spectral Residual method, indicating that the relationship
between the spectral characteristics of an image and its memorability is somewhat weaker than
a more nuanced approach. As can be seen in Figure 1, the Fine-Grained saliency maps are more
detailed, in contrast with the less defined aspect of the Spectral Residual, which may influence
on the selected segment. However, it seems that fine-tuned method benefits equally from
segments across the whole video, independently of which part of it is used as input. Although
we believe this is a sign of the robustness of our proposal, a more in-depth analysis of the
relationship between image saliency and annotators response in terms of memorability could
possibly further enhance the capabilities of this type of architecture.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>In this paper we outline our contribution to the MediaEval 2023 Predicting Video
Memorability task. We propose to leverage pre-trained Video Transformers in order to create robust
memorability predictors that take sequences of frames as input. We also explore automatic
segment selection methods based on saliency. Our results show that fine-tuning significantly
outperforms training from scratch on our setup, but that the model is not specially sensible to
automatic selection methods. We aim to deepen our exploration on the matter by developing
advanced methods based on saliency and other perceptual features that output multiple
candidate segments in order to broaden the training information, as well as evaluating the potential
benefits of these methods on models trained from scratch.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>We would like to thank M. Gabriel Constantin for his insights on his work, which have been
greatly helpful for our research. I.M.-F.’s research was supported by the UPM (Programa Propio
I+D+i). This work was funded by Project ASTOUND (101071191 —
HORIZON-EIC-2021PATHFINDERCHALLENGES-01) of the European Commission and by the Spanish Ministry of
Science and Innovation through the projects GOMINOLA (PID2020-118112RB-C22) and BeWord
(PID2021-126061OB-C43), funded by MCIN/AEI/10.13039/501100011033 and by the European
Union “NextGenerationEU/PRTR”</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Constantin</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Demarty</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Fosco</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>García Seco de Herrera</surname>
            , S. Halder,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Healy</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Ionescu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Matran-Fernandez</surname>
            ,
            <given-names>R. Savran</given-names>
          </string-name>
          <string-name>
            <surname>Kiziltepe</surname>
            ,
            <given-names>A. F.</given-names>
          </string-name>
          <string-name>
            <surname>Smeaton</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Sweeney</surname>
          </string-name>
          ,
          <article-title>Overview of the mediaeval 2023 predicting video memorability task</article-title>
          ,
          <source>in: Proc. of the MediaEval 2023 Workshop</source>
          , Amsterdam, The Netherlands and Online,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          , et al.,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>11929</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Constantin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <article-title>Using vision transformers and memorable moments for the prediction of video memorability</article-title>
          , in: MediaEval 2021 workshop,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Agarla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Celona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Schettini</surname>
          </string-name>
          , et al.,
          <article-title>Predicting video memorability using a model pretrained with natural language supervision</article-title>
          , in: MediaEval Multimedia Benchmark Workshop 2022 Working Notes,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Constantin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          , Aimultimedialab at mediaeval 2022:
          <article-title>Predicting media memorability using video vision transformers and augmented memorable moments</article-title>
          ,
          <source>Working Notes Proceedings of the MediaEval 2022 Workshop</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Arnab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lučić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          ,
          <article-title>Vivit: A video vision transformer</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF international conference on computer vision</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>6836</fpage>
          -
          <lpage>6846</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>W.</given-names>
            <surname>Kay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hillier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vijayanarasimhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Viola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Green</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Back</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Natsev</surname>
          </string-name>
          , et al.,
          <article-title>The kinetics human action video dataset</article-title>
          ,
          <source>arXiv preprint arXiv:1705.06950</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Isola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Torralba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oliva</surname>
          </string-name>
          ,
          <article-title>Understanding the intrinsic memorability of images</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>24</volume>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Peterson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Khosla</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-H. Yang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Ghanem</surname>
          </string-name>
          ,
          <article-title>What makes an object memorable?</article-title>
          ,
          <source>in: Proceedings of the IEEE International Conference on Computer Vision</source>
          (ICCV),
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mancas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. Le</given-names>
            <surname>Meur</surname>
          </string-name>
          ,
          <article-title>Memorability of natural scenes: The role of attention</article-title>
          ,
          <source>in: 2013 IEEE International Conference on Image Processing</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>196</fpage>
          -
          <lpage>200</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICIP.
          <year>2013</year>
          .
          <volume>6738041</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mudgal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sweeney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Smeaton</surname>
          </string-name>
          ,
          <article-title>Using saliency and cropping to improve video memorability</article-title>
          ,
          <source>arXiv preprint arXiv:2309.11881</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Montabone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Soto</surname>
          </string-name>
          ,
          <article-title>Human detection using a mobile platform and novel features derived from a visual saliency mechanism</article-title>
          ,
          <source>Image and Vision Computing</source>
          <volume>28</volume>
          (
          <year>2010</year>
          )
          <fpage>391</fpage>
          -
          <lpage>402</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bradski</surname>
          </string-name>
          , The OpenCV Library, Dr.
          <source>Dobb's Journal of Software Tools</source>
          (
          <year>2000</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>X.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. Zhang,</surname>
          </string-name>
          <article-title>Saliency detection: A spectral residual approach</article-title>
          ,
          <source>in: 2007 IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2007</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2007</year>
          .
          <volume>383267</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>