<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>AIMultimediaLab at MediaEval 2022: Predicting Media Memorability Using Video Vision Transformers and Augmented Memorable Moments</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mihai Gabriel Constantin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bogdan Ionescu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University Politehnica of Bucharest</institution>
          ,
          <country country="RO">Romania</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <abstract>
        <p>This paper describes AIMultimediaLab's approach and results achieved during the 2022 MediaEval Predicting Video Memorability task. The proposed approach represents a continuation of last year's work, using, updating and better analysing the concept of Memorable Moments. This is done by improving the scheme we use for selecting Memorable Moments, and allowing for the possibility that more than one video segment is representative for the entire video clip from a memorability standpoint. Furthermore, we propose studying a new architecture for processing the selected Memorable Moments, by implementing a variant of the popular ViViT architecture, that is more suited to analysing pure video content.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Media Memorability is one of the domains that lately gained considerable traction in the research
community, thanks to the need of novel and better methods of classifying the huge quantities
of data associated with social media and video content sharing platforms. While previous
work focused more on the prediction of image-based content, lately a significant push for
video-based processing can be noticed in the multimedia research environment. In this context,
the MediaEval 2022 Predicting Video Memorability task [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], one of the drivers of this tendency
and now at its fifth edition, proposes three subtasks, based on the prediction of short-term video
memorability: a video-based prediction task, a generalization task and an EEG-based task. The
data ofered by the organizers of this task is extracted from two popular datasets, namely the
Memento10k dataset [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for the first two tasks and the VideoMem [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] dataset for the second
task, that are enhanced by the addition of EEG data for the third task.
      </p>
      <p>As we will show, this paper represents a continuation and an ongoing work on the study
of Memorable Moments in particular, and of the way video segments can represent or can be
interpreted as representative of an entire video in general. For this edition of the MediaEval
Memorability task, we will only be participating to the first subtask. The rest of the paper is
organized as follows: Section 2 presents our approach, while Section 3 presents and analyzes
the results, and the paper concludes with Section 4.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Proposed Approach</title>
      <p>
        Our proposed method represents a continuation of last year’s work [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], where we studied
the use of two popular Vision Transformer architectures, namely DeiT [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and the BEiT [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
      </p>
      <p>Repeatable
Spatio-Temporal
Transformer
MLP</p>
      <p>Norm
Multi-Head Attn</p>
      <p>Norm
Positional Embedding</p>
      <p>
        Token Embedding
networks for extracting visual features from selected frames, and a frame selection method we
called Memorable Moments. We proposed and proven that having a frame selection method is
a positive addition to the overall performance, as this would spare the network from processing
frames that perhaps represent noise or are unimportant to the overall memorability score of
the video in itself. The first of the changes we propose in this paper is represented by the
replacement of the two architectures with an architecture dedicated to video processing, namely
the popular transformer-based ViViT [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] model. Furthermore, we propose augmenting the
previously developed Memorable Moments. While in the past, we only selected one region
per video, that corresponded to the region most indicated by the annotators as memorable,
now we allow several regions per video, as we theorize that this will take into account more
representative video segments. This approach is shown in Figure 1, with the neural transformer
architecture shown on the left and the frame selection scheme shown on the right side of the
ifgure.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Neural Network Architecture</title>
        <p>
          As already stated, we use the ViViT architecture [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] for processing the frames. Specifically, we
deploy the spatio-temporal multi-headed self-attention ViViT model. This model uses a tubelet
embedding that takes 3-dimensional tubes and feeds them to the network, created from the two
spatial and one temporal dimension, in order to ensure that the network has access directly to
spatio-temporal information. We format this network to take 15 frames as the input, therefore
we fix the window of frames from the beginning of our experiments. This architecture takes
the input and passes it through a variable number of repeatable spatio-temporal transformer
blocks. Each of these blocks are composed of a self-attention block and a MLP block.
        </p>
        <p>We vary several parameters of the network in order to search for a robust architecture that
would best fit our experiments. First of all, we test several values for the number of parallel
self-attention heads in each block, using the values 4, 8, 16, 32. Secondly, we vary the number
of repeatable blocks, using the values 4, 8, 16, 32.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Memorable Moments</title>
        <p>Regarding the Memorable Moments scheme, we propose several variations for deploying this
frame filtering method. Theoretically, given a video clip that is composed of a set of  frames
denoted  = [1, 2, ...,  ], and a set of  annotators  = [1, 2, ... ], each annotator
that will go watch the video will press a button whenever they recognise the target video. Given
a delay in response time of 500 milliseconds, corresponding to approximately 15 frames, that
we determined and used in the previous version of the Memorable Moments scheme, we can
calculate the central frame  that corresponds to each annotator’s  moment of recognition.</p>
        <p>Furthermore, we can allocate a score of 1 to each frame in a video that corresponds to an
annotation, and can even extend that to the window of 15 frames that we chose at the begging
of the experiments. Therefore if a central frame  gets a score of 1 from annotator  , the
entire window composed of [− 7, ..., , ...+7] gets that score. Finally, summing up all the
annotations and given  the score for frame  from annotator  for a video we get the formula
 = ∑︀=1</p>
        <p>In the next step we propose three methods for selecting the frames, called Single, Double
and Multi. The Single method consists of basically selecting the highest value of  and using
a 15-frame window around it as the single representative segment of the video. The Double
method consists of selecting the two highest values of , and using them as two central frames
that generate two representative segments of the video. The final method, the Multi, uses a
threshold value  ∈ (0, 1). We select the highest value of , and all values that are higher than
 ×  , therefore getting a variable number of representative segments for each video. In cases
of equality between  values we choose to take the frame with the lowest  value. Finally, we
test several values for  , namely 0.95, 0.90, 0.85, 0.75.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results and Analysis</title>
      <sec id="sec-3-1">
        <title>3.1. Results on the development set</title>
        <p>We conduct a set of preliminary experiments in the training phase, consisting of finding the
optimal values for the ViViT size and for the  parameter. The results of these experiments
are presented in Table 1. Each variation of the method has only one variable parameter, the
other ones being in default mode for the experiment. Also, when varying the number of heads
and the number of repeats the Single method for Memorable Moments is applied. Considering
these experiments are done on the development set, 7000 movies are used for training (the
training-set) and 1500 for validation (the development-set).</p>
        <sec id="sec-3-1-1">
          <title>Nr. Heads</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>Spearman</title>
        </sec>
        <sec id="sec-3-1-3">
          <title>Nr. Repeats Spearman 4</title>
          <p>8
16
32</p>
        </sec>
        <sec id="sec-3-1-4">
          <title>Final results for the proposed method, under the Single, Double and Multi Memorable Moments config</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Results on the testing set</title>
        <p>Following this, we use the three values determined in the previous experiments, namely 8
multi-attention heads, 8 repeating blocks and an  value of 0.85 and submit three systems for
evaluation by the Memorability task organizers. These three systems are represented by the
Single, Double and Multi variations of the Memorable Moments selection scheme. The results are
presented in Table 2, where the best performing method is shown to be the Multi configuration,
with a Spearman value of 0.665. Wihle we can observe a significant growth in performance even
when comparing the Single with the Double methods, an even better performance is recorded
by the Multi approach, with almost 2% growth over the Double approach.</p>
        <p>We theorize at this moment that this type of performance was to be expected, as each
additional Memorable Moments configuration progressively adds more videos as representative
of the video clips in the collection, therefore creating more training data. We propose that
it may be interesting to research this problem on a diferent dataset, that perhaps contains
more actions. Our reason for proposing this is that, in the current Memento dataset the video
clips have 3 seconds and generally the actions shown in the clips do not change. It is possible
that in longer clips the changes in actions or angles may be more significant and having more
representatives for each video may improve the results even more.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>In this paper we presented our approach for the MediaEval 2022 Predicting Video Memorability,
consisting of an updated frame selection method called Memorable Moments, that has the
role of selecting one or more representatives from each video clip for processing and training,
and a video vision transformer ViViT architecture. Results show that selecting more than one
representative for each video improves overall performance.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>Financial support provided under project AI4Media, a European Excellence Centre for Media,</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Sweeney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Constantin</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Demarty</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Fosco</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            , S. Halder,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Healy</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Ionescu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Matran-Fernandez</surname>
            ,
            <given-names>A. F.</given-names>
          </string-name>
          <string-name>
            <surname>Smeaton</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Sultana, Overview of the MediaEval 2022 predicting video memorability task</article-title>
          ,
          <source>in: Working Notes Proceedings of the MediaEval 2022 Workshop</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Newman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fosco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Casser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>McNamara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oliva</surname>
          </string-name>
          , Multimodal memorability:
          <article-title>Modeling efects of semantics and decay on video memorability</article-title>
          , in: A.
          <string-name>
            <surname>Vedaldi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Bischof</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Brox</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-M. Frahm</surname>
          </string-name>
          (Eds.),
          <source>Computer Vision - ECCV 2020</source>
          , Springer International Publishing, Cham,
          <year>2020</year>
          , pp.
          <fpage>223</fpage>
          -
          <lpage>240</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cohendet</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Demarty</surname>
            ,
            <given-names>N. Q.</given-names>
          </string-name>
          <string-name>
            <surname>Duong</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Engilberge</surname>
          </string-name>
          ,
          <article-title>Videomem: Constructing, analyzing, predicting short-term and long-term video memorability</article-title>
          ,
          <source>in: Proceedings of the IEEE International Conference on Computer Vision</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>2531</fpage>
          -
          <lpage>2540</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Constantin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <article-title>Using vision transformers and memorable moments for the prediction of video memorability</article-title>
          , in: MediaEval Multimedia Benchmark Workshop Working Notes,
          <year>2021</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3181</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cord</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Douze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jégou</surname>
          </string-name>
          ,
          <article-title>Training data-eficient image transformers &amp; distillation through attention</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>10347</fpage>
          -
          <lpage>10357</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          , Beit:
          <article-title>Bert pre-training of image transformers</article-title>
          ,
          <source>arXiv preprint arXiv:2106.08254</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Arnab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lučić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          ,
          <article-title>Vivit: A video vision transformer</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>6836</fpage>
          -
          <lpage>6846</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>