<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>AIMultimediaLab at MediaEval 2023: Studying the Generalization of Media Memorability Methods.</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mihai Gabriel Constantin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bogdan Ionescu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University Politehnica of Bucharest</institution>
          ,
          <country country="RO">Romania</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Video memorability is one of the vital aspects of subjective multimedia perception and, as such, is closely and thoroughly studied in the computer vision literature. This paper presents the methods proposed by AIMultimediaLab for the generalization subtask of the 2023 edition of the Predicting Video Memorability task. We explore several methods for augmenting the training process for a video Vision Transformer network, aiming to increase the number of hard-to-predict samples in the training set in order to increase the robustness of the targeted AI model. Starting from our previous works, we analyze several visual features that define "hard-to-predict" samples, and based on these features, we augment the training data of our models to target those specific videos that pose problems for memorability prediction.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The prediction of video memorability is an essential aspect in the subjective analysis of
multimedia content, with the MediaEval Predicting Video Memorability1 series of benchmarking
tasks playing an important role in bringing attention in the computer vision community to
the study of this concept. While previous editions of this benchmarking task have focused on
memorability prediction in videos extracted, annotated and processed in similar conditions,
this edition [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] focuses on the generalization task. Concretely, the organizers ask participants
to train on data extracted from one memorability dataset, namely the Memento10k [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and
test their trained systems on data extracted from the VideoMem [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] dataset. This allows for
an interesting setup where AI models are exposed to diferent types of videos, annotated by
diferent people, and extracted from diferent sources, thus creating a testing scenario that better
simulates real-world conditions.
      </p>
      <p>As we will show throughout the paper, this work represents the continuation of some of our
previous works on memorability, particularly those targeting the use of vision transformers
in the prediction of subjective concepts, and a sample-based analysis of videos we defined as
"hard-to-predict" from a memorability standpoint. We continue this work by applying training
augmentation, particularly for the problematic videos for our vision transformer networks. The
rest of the paper is structured as follows. Section 2 presents previous works our methods are
based on. Following this, the methods employed by our team are presented in Section 3, while
our results are presented in Section 4. Finally, the paper concludes with Section 5.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        Our proposed method is built upon two of our previous works. The first one, published in the
previous edition of the MediaEval memorability task [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], uses vision transformers to predict
video memorability. More precisely, it is represented by a vision transformer model, derived
from the popular ViViT [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] neural network. We will also use the "Memorable Moments" video
segment selection method to select the most representative segments from the training videos
and use only those segments during the training phase. The second work is represented by a
feature-based analysis of all the runs submitted during the previous edition of the memorability
task [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In this paper, we showed the emergence of some video samples that are significantly
harder to classify by all participants, regardless of the systems they used or their pre-processing
methods.
      </p>
      <p>
        Starting from these two works, we select two segments as representatives for each video in the
training set, based on the Memorable Moments approach presented in the first work [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We then
employ several methods, similar to our second work [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], for detecting the videos in the training
set that may be hard to classify by the proposed ViViT-derived model. While [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] presents just an
analysis of submitted runs and launches some interesting hypotheses concerning the features
that make a video hard to classify, this paper seeks to test these hypotheses and apply them
directly to media memorability prediction.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>
        A general diagram of the training method we propose is presented in Figure 1. We propose
using the Memorable Moments selection scheme, as presented in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], in order to select the two
most representative video segments in each video from the training set. Following this, we
analyze several methods of determining which videos are challenging and which are easy to
classify regarding their memorability score, using a set of features and visual descriptors. In the
last step, we keep only the most representative video segment for easy-to-classify videos, and
keep both segments for the hard-to-classify ones. We theorize that this imbalance we create in
the dataset may allow the ViViT-based model to better learn the ground truth of samples that it
may otherwise mispredict.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Memorable Moments</title>
        <p>
          We use the Memorable Moments selection scheme, as presented in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Concretely, given a
video clip composed of  frames:  = 1, 2, ... . Using the annotations provided by the
competition organizers, we provide a score of 1 for the frame corresponding to the moment
of recall while accounting for a 500 milliseconds delay in response and extending this score
to a window of 15 frames around the moment of recall. We gather all the annotations and
thus obtain a frame-level score of recall for each video as follows:  = [1, 2, ... ]. Finally,
taking the top two recall scores, we obtain the top two most significant Memorable Moments
for each particular video. We then extract two video segments around these frames, each of
them 15 frames long.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Prediction dificulty assessment</title>
        <p>
          In [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] we presented a set of methods for determining which features are the most discriminative
when analyzing the dificulty of media memorability prediction. Perhaps unsurprisingly, we
found that videos with average memorability ground truth scores are more challenging to
predict accurately than videos that are either very memorable or have low memorability. Other
discriminative features are as follows: sharpness computed via the Laplacian operator [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]
(sharper videos are harder to classify with regards to memorability), contrast computed in
RGB space [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] (higher contrast videos are harder to classify), and dynamism computed via the
Farnebäck method [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>We use these four methods to determine which videos are more challenging to classify by AI
models. As presented in the previous section, we will only keep two Memorable Moments video
segments only for the videos deemed as "hard-to-predict". We will therefore compute the values
for each of the four features (ground truth score, sharpness, contrast, and dynamism), and split
the training set into four quartiles, according to the value of each feature, with the top quartile,
1, representing the videos that theoretically should be easier to predict according to each
feature, and the bottom quartile, 4, representing those that would be hard to predict. The entire
training set will thus be divided, for each feature  , as follows:  = 1, ∪ 2, ∪ 3, ∪ 4, .
We will then keep two Memorable Moments segments only for the videos that belong to the
bottom quartile, 4.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Vision transformer network</title>
        <p>
          We apply these training augmentation schemes to a vision transformer deep neural network
that is based upon the ViViT architecture [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Specifically, we used the tubelet embedding that
encodes spatio-temporal information as 3-dimensional tubes and feeds them to the network
for training and inference. This architecture handles the 3-dimensional input by passing it
through a series of repeatable spatio-temporal attention blocks. Based on the conclusions of
our previous work [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], we design the network so that it can handle 15 frames at input, and use
8 parallel self-attention heads in each block, and a number of 8 repeatable transformer blocks.
        </p>
        <p>This network is then trained in five diferent setups. In the original setup, only one segment
per video is fed into the network at training time. We consider this setup as the baseline for
our approach. The following four setups will contain augmented samples, represented by one
additional video segment for each video in the 4 quartile for each of the selected discriminative
features.</p>
        <p>Augmentation Method
ViViT - baseline
ViViT + GT score
ViViT + sharpness
ViViT + contrast
ViViT + dynamism</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>We present the results in Table 1. Firstly, we analyze the results on the validation set of this
task, which is composed of the Memento10k devset. While the un-augmented baseline system
results are good enough with a Spearman’s Rank Correlation Coeficient (SRCC) of 0.651, two
of the proposed augmentation methods outscore this for the validation experiments, even if by
a small margin. These two methods are represented by the ground-truth score-based quartile
augmentation with an SRCC value of 0.668 (ViViT + GT score), and the dynamism-based
augmentation (ViViT + dynamism), with an SRCC value of 0.680. On the other hand, the two
other methods score lower on the devset when compared with the baseline method.</p>
      <p>Similar trends are noticeable when looking at the oficial testset results. The ViViT + GT
score method has the highest performance, with an SRCC value of 0.382, closely followed by the
ViViT + dynamism approach with 0.380. The baseline method scores 0.361, while the sharpness
and contrast methods have even lower scores.</p>
      <p>When comparing the direct prediction performance on the Memento10k devset with the
oficial generalization scores on the VideoMem dataset, we notice a sharp decline in performance.
This is an indication of the significant level of dificulty generalization tasks pose. On the other
hand, we are pleased to report that at least two of the proposed feature-based augmentation
methods scored better than the baseline method. The GT score-based method achieved a 5.81%
increase over the baseline run, with similar performance from the dynamism-based method.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>We presented a sample-based augmentation method for media memorability prediction, in
a generalization setup, where the training and the testing data came from diferent datasets,
which involved diferent video sources and annotators. Our best performing method augmented
the samples at training time based on their ground truth scores. Concretely, videos that have
ground truth memorability values close to the average were augmented, thus resulting in a
schema that increases the number of hard-to-predict videos, allowing the AI model to learn
more details about these videos.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>Financial support provided under project AI4Media, a European Excellence Centre for Media,
Society and Democracy, H2020 ICT-48-2020, grant #951911.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Constantin</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Demarty</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Fosco</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            , S. Halder,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Healy</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Ionescu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Matran-Fernandez</surname>
            ,
            <given-names>A. F.</given-names>
          </string-name>
          <string-name>
            <surname>Smeaton</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Sweeney</surname>
          </string-name>
          ,
          <article-title>Overview of the MediaEval 2023 predicting video memorability task</article-title>
          ,
          <source>in: Working Notes Proceedings of the MediaEval 2023 Workshop</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Newman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fosco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Casser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>McNamara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oliva</surname>
          </string-name>
          , Multimodal memorability:
          <article-title>Modeling efects of semantics and decay on video memorability</article-title>
          , in: Computer Vision-ECCV
          <year>2020</year>
          : 16th European Conference, Glasgow, UK,
          <year>August</year>
          23-
          <issue>28</issue>
          ,
          <year>2020</year>
          , Proceedings,
          <source>Part XVI 16</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>223</fpage>
          -
          <lpage>240</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cohendet</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Demarty</surname>
            ,
            <given-names>N. Q.</given-names>
          </string-name>
          <string-name>
            <surname>Duong</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Engilberge</surname>
          </string-name>
          ,
          <article-title>Videomem: Constructing, analyzing, predicting short-term and long-term video memorability</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>2531</fpage>
          -
          <lpage>2540</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Constantin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          , Aimultimedialab at mediaeval 2022:
          <article-title>Predicting media memorability using video vision transformers and augmented memorable moments</article-title>
          ,
          <source>in: Working Notes Proceedings of the MediaEval 2022 Workshop</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Arnab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lučić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          ,
          <article-title>Vivit: A video vision transformer</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>6836</fpage>
          -
          <lpage>6846</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Constantin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dogariu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Jitaru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <article-title>Assessing the dificulty of predicting media memorability</article-title>
          ,
          <source>in: 20th International Conference on Content-based Multimedia Indexing</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>188</fpage>
          -
          <lpage>192</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <article-title>An iris image quality assessment method based on laplacian of gaussian operation</article-title>
          .,
          <source>in: MVA</source>
          ,
          <year>2007</year>
          , pp.
          <fpage>248</fpage>
          -
          <lpage>251</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Jing</surname>
          </string-name>
          ,
          <article-title>The design of high-level features for photo quality assessment</article-title>
          ,
          <source>in: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06)</source>
          , volume
          <volume>1</volume>
          , IEEE,
          <year>2006</year>
          , pp.
          <fpage>419</fpage>
          -
          <lpage>426</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Farnebäck</surname>
          </string-name>
          ,
          <article-title>Two-frame motion estimation based on polynomial expansion</article-title>
          ,
          <source>in: Image Analysis: 13th Scandinavian Conference</source>
          , SCIA 2003 Halmstad, Sweden, June 29-July 2,
          <source>2003 Proceedings 13</source>
          , Springer,
          <year>2003</year>
          , pp.
          <fpage>363</fpage>
          -
          <lpage>370</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>