<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploring Multimodality, Perplexity and Explainability for Memorability Prediction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alison Reboud</string-name>
          <email>alison.reboud@eurecom.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ismail Harrando</string-name>
          <email>ismail.harrando@eurecom.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jorma Laaksonen</string-name>
          <email>jorma.laaksonen@aalto.fi</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raphaël Troncy</string-name>
          <email>raphael.troncy@eurecom.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aalto University</institution>
          ,
          <addr-line>Espoo</addr-line>
          ,
          <country country="FI">Finland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>EURECOM</institution>
          ,
          <addr-line>Sophia Antipolis</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper describes several approaches proposed by the MeMAD Team for the MediaEval 2021 “Predicting Media Memorability” task. Our best approach is based on early fusion of multimodal (visual and textual) features. We also designed one of our run to be explainable in order to give new insights into the topic of audio visual content memorability. Finally, one of our runs is an experiment in analysing the potential role played by text perplexity in video content memorability.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>APPROACH</title>
      <p>
        The description of the task as well as the metrics used for its
evaluation is detailed in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. We have experimented in the past with
approaches combining textual and visual features [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] as well as
using visio-linguistic models [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] for predicting short and long term
media memorability. This year, we have explored other methods
including: i) performing early fusion of multimodal features, ii)
attempting to explain whether some phrases could trigger
memorability or not and iii) estimating the perplexity of video descriptions.
Our code to enable reproducibility of our approaches is available at
https://github.com/MeMAD-project/media-memorability.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Early Fusion of Multimodal Features</title>
      <p>Textual features. Our textual approach uses the video
descriptions (or captions) provided by the task organizers. First, we
concatenate the video descriptions to obtain one string for each video.
Then, to get a textual representation of the video content, we
experimented with the following methods:
•
•
•
•</p>
      <p>Computing TF-IDF, removing rare (less than 4 occurrences) and
stopwords and accounting for frequent 2-grams.</p>
      <p>
        Averaging GloVe embeddings for all non-stopwords words using
the pre-trained 300d version [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        Averaging BERT [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] token representations (keeping all the words
in the descriptions up to 250 words per sentence).
      </p>
      <p>
        Using Sentence-BERT [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] sentence representations and in
particular the distilled version that is fine-tuned for the STS Textual
Similarity Benchmark1
1https://huggingface.co/sentence-transformers/distilbert-base-nli-stsb-mean-tokens
      </p>
      <p>Using again Sentence-BERT with the model fine-tuned on the
Yahoo answers topics dataset, comprising of questions and answers
from Yahoo Answers, classified into 10 topics.</p>
      <p>For each representation, we experimented with multiple regression
models and fine-tuned the hyper-parameters using a fixed 6-fold
cross-validation on the training set. For our submission, we used
the Sentence-BERT on Yahoo answers topic dataset model.</p>
      <p>
        Visual features. We extracted 2048-dimensional I3D [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] features
to describe the visual content of the videos. The I3D features are
extracted from the Mixed_5c layer of the readily-available model
trained with the Kinetics-400 dataset [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. These features
performance are superior to those extracted from the 400-dimensional
classification output and the C3D [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] features provided by the task
organizers.
      </p>
      <p>
        Audio features. We used 527-dimensional audio features that
encode the occurrence probabilities of the 527 classes of the Google
AudioSet Ontology [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] in each video clip. The model uses the
readilyavailable VGGish feature extraction model [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>Prediction model. In all our early fusion experiments, the
respective features were concatenated to create multimodal input feature
vectors. We used a feed-forward network with one hidden layer
to predict the memorability score. We varied the number of units
in the hidden layer and optimized it together with the number of
training epochs. We used ReLU non-linearity and dropout between
the layers and simple sigmoid output for the regression result. The
experiments used the same 6-fold cross-validation on the training
set. The best models typically consisted of 600 units in the hidden
layer and needed 700 training epochs to produce the maximal
Spearman correlation score. We have also experimented with a weighted
average to combine modalities, but early fusion turned out to be
more successful.
1.2</p>
    </sec>
    <sec id="sec-3">
      <title>Exploring Explainability</title>
      <p>We have experimented with diferent simple text-based models that
ofer the possibility to quantify the relation between the caption
and the predicted memorability score in an explainable manner. We
train the models given the specific sub-task and dataset, i.e. for the
short-term memorability predictions, we train the models on the
short-term memorability scores.</p>
      <p>
        We compare feeding simple linear models (regressors)
interpretable input features: bag of words, TF-iDF, and topic
distributions produced by an LDA model [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] trained on the corpus made
of captions. Upon evaluating the performance of each model/input
feature pair in a cross-fold validation protocol, we obtain the best
results using TF-iDF features with a Linear Support Vector
Regression (LinearSVR2). On one hand, this model allows us to somewhat
understand the correspondence between some input words and
the final score of classification. For example, the top words for
raw and normalized short-term memorability on both Memento10K
and TRECVID is woman. On the other hand, the empirical
performance on both subtasks falls significantly behind other models,
demonstrating both the non-linear and multimodal nature of
memorability.
1.3
      </p>
    </sec>
    <sec id="sec-4">
      <title>Exploring Perplexity</title>
      <p>
        It has been suggested that memorable content can be found in
sparse areas of an attribute space [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For example, images with
convolutional neural networks features sparsely distributed have
been found to be more memorable [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Additionally, we observe
that the results obtained on the TRECVID dataset (made of short
videos from Vine) are considerably worse than those obtained on
the Memento10K dataset which may be due to the fact that the
TRECVID dataset is smaller but also much more diverse. One
hypothesis is that popular vines break with expectations. Backing
this hypothesis, we have found in the TRECVID dataset that videos
depicting a person eating a car, or a chicken coming out of an egg
to have a high memorability score. Therefore, inspired by [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] who
predicts the novelty of a caption, we wanted to test the hypothesis
that the novelty of a caption influences its memorability.
      </p>
      <p>We explore the (pseudo-)perplexity of each video description
using a pretrained RoBERTa-large model. The score for each caption
is computed by adding up the log probabilities of each masked token
in the caption, and the aggregation between captions is done with
a max function. We select the caption with the highest perplexity
for each video. All runs have identical scores for each dataset as we
do not use the training set at all in this method.
2</p>
    </sec>
    <sec id="sec-5">
      <title>RESULTS AND DISCUSSION</title>
      <p>We have prepared 5 diferent runs following the task description
defined as follows:
• run1 = Explainable (Section 1.2)
• run2 = Early Fusion of Textual+Visual+Audio features
• run3 = Early Fusion of Textual+Visual features
• run4 = Perplexity-based (Section 1.3)
• run5 = Early fusion of Textual+Visual features trained on the
combined (TRECVID + Memento10k) datasets
All models except the run1 use exclusively short-term scores for
predicting the long-term score.</p>
      <p>We present in Tables 1 and 2 the final results obtained on the
test set of respectively the TRECVID and the Memento10k datasets.
We comment on the Spearman Rank scores as this is the oficial
evaluation metrics. We observe that the early fusion method which
uses short term scores works the best for both short and long term
predictions. Adding the audio modality features did not improve
the results. We can also observe that the results for Long Term
prediction are always worse than the ones for Short Term prediction
2https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html
and the results for Memento10K are always better. Combining the
datasets did not yield better results. This is not very surprising for
the Memento10K results since it is a bigger dataset. However, the
fact that augmenting the TRECVID dataset did not lead to
significant improvement suggests that beyond a size diference, there
is a diference in nature between the datasets that leads to a bad
generalisation in terms of prediction. This fact is confirmed by
the generalisation subtask which yields significantly worse results
for both Memento10K and TRECVID. Finally the scores obtained
with the perplexity run were by far the lowest, only reaching 0.073
for Memento10K when our best run obtained 0.658. With this run,
rather than obtaining the best results, we wanted to evaluate the
potential for adding a caption perplexity measure. At this stage, these
results do not suggest a strong relationship between perplexity and
memorability.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Wilma</surname>
            <given-names>A</given-names>
          </string-name>
          <string-name>
            <surname>Bainbridge</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Shared memories driven by the intrinsic memorability of items. Human Perception of Visual Information: Psychological and Computational Perspectives (</article-title>
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>David</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>Andrew Y.</given-names>
          </string-name>
          <string-name>
            <surname>Ng</surname>
            , and
            <given-names>Michael I.</given-names>
          </string-name>
          <string-name>
            <surname>Jordan</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Latent Dirichlet Allocation</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>3</volume>
          (
          <year>2003</year>
          ),
          <fpage>993</fpage>
          --
          <lpage>1022</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>João</given-names>
            <surname>Carreira</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset</article-title>
          .
          <source>In IEEE Conference on Computer Vision</source>
          and
          <article-title>Pattern Recognition (CVPR)</article-title>
          . IEEE,
          <fpage>4724</fpage>
          -
          <lpage>4733</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          .
          <article-title>In Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)</article-title>
          . ACL, Minneapolis, Minnesota, USA,
          <fpage>4171</fpage>
          --
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Jort</surname>
            <given-names>F Gemmeke</given-names>
          </string-name>
          , Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence,
          <string-name>
            <given-names>R Channing</given-names>
            <surname>Moore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Manoj</given-names>
            <surname>Plakal</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Marvin</given-names>
            <surname>Ritter</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Audio set: An ontology and human-labeled dataset for audio events</article-title>
          .
          <source>In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          . New Orleans, Louisiana, USA,
          <fpage>776</fpage>
          -
          <lpage>780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Shawn</given-names>
            <surname>Hershey</surname>
          </string-name>
          , Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen,
          <string-name>
            <given-names>R. Channing</given-names>
            <surname>Moore</surname>
          </string-name>
          , Manoj Plakal, Devin Platt, Rif A.
          <string-name>
            <surname>Saurous</surname>
            , Bryan Seybold, Malcolm Slaney,
            <given-names>Ron J.</given-names>
          </string-name>
          <string-name>
            <surname>Weiss</surname>
          </string-name>
          , and Kevin Wilson.
          <year>2017</year>
          .
          <article-title>CNN Architectures for Large-Scale Audio Classiifcation</article-title>
          . (
          <year>2017</year>
          ).
          <source>arXiv:cs.SD/1609.09430</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Will</given-names>
            <surname>Kay</surname>
          </string-name>
          , Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>The Kinetics Human Action Video Dataset</article-title>
          . (
          <year>2017</year>
          ).
          <source>arXiv:cs.CV/1705.06950</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Rukiye</given-names>
            <surname>Savran</surname>
          </string-name>
          <string-name>
            <given-names>Kiziltepe</given-names>
            , Mihai Gabriel Constantin,
            <surname>Claire-Hélène</surname>
          </string-name>
          <string-name>
            <surname>Demarty</surname>
          </string-name>
          , Graham Healy, Camilo Fosco, Alba García Seco de Herrera, Sebastian Halder, Bogdan Ionescu, Ana Matran-Fernandez,
          <string-name>
            <given-names>Alan F.</given-names>
            <surname>Smeaton</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Lorin</given-names>
            <surname>Sweeney</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Overview of The MediaEval 2021 Predicting Media Memorability Task</article-title>
          . In Multimedia Benchmark Workshop (MediaEval).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Jiří</given-names>
            <surname>Lukavsky</surname>
          </string-name>
          ` and
          <string-name>
            <given-names>Filip</given-names>
            <surname>Děchtěrenko</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Visual properties and memorising scenes: Efects of image-space sparseness and uniformity</article-title>
          . Attention, Perception, &amp; Psychophysics 79,
          <issue>7</issue>
          (
          <year>2017</year>
          ),
          <fpage>2044</fpage>
          -
          <lpage>2054</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Nianzu</surname>
            <given-names>Ma</given-names>
          </string-name>
          , Alexander Politowicz, Sahisnu Mazumder, Jiahua Chen, Bing Liu, Eric Robertson, and
          <string-name>
            <given-names>Scott</given-names>
            <surname>Grigsby</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Semantic Novelty Detection in Natural Language Descriptions</article-title>
          .
          <source>In International Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          .
          <volume>866</volume>
          -
          <fpage>882</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Jefrey</surname>
            <given-names>Pennington</given-names>
          </string-name>
          , Richard Socher, and
          <string-name>
            <given-names>Christopher D.</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Glove: Global vectors for word representation</article-title>
          .
          <source>In International Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          . ACL, Melbourne, Australia,
          <fpage>1532</fpage>
          --
          <lpage>1543</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Alison</surname>
            <given-names>Reboud</given-names>
          </string-name>
          , Ismail Harrando, Jorma Laaksonen, Danny Francis, Raphael Troncy, and Hector Laria Mantecon.
          <year>2019</year>
          .
          <article-title>Combining Textual and Visual Modeling for Predicting Media Memorability</article-title>
          .
          <source>In Multimedia Benchmark Workshop (MediaEval) (CEUR Workshop Proceedings)</source>
          , Vol.
          <volume>2670</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Alison</surname>
            <given-names>Reboud</given-names>
          </string-name>
          , Ismail Harrando, Jorma Laaksonen, and
          <string-name>
            <given-names>Raphael</given-names>
            <surname>Troncy</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Predicting Media Memorability with Audio, Video, and Text representations</article-title>
          .
          <source>In Multimedia Benchmark Workshop (MediaEval) (CEUR Workshop Proceedings)</source>
          , Vol.
          <volume>2882</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Nils</given-names>
            <surname>Reimers</surname>
          </string-name>
          and
          <string-name>
            <given-names>Iryna</given-names>
            <surname>Gurevych</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks</article-title>
          .
          <source>In International Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          . ACL,
          <string-name>
            <surname>Hong</surname>
            <given-names>Kong</given-names>
          </string-name>
          , China,
          <fpage>3982</fpage>
          --
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Du</surname>
            <given-names>Tran</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Lubomir D.</given-names>
            <surname>Bourdev</surname>
          </string-name>
          , Rob Fergus, Lorenzo Torresani, and
          <string-name>
            <given-names>Manohar</given-names>
            <surname>Paluri</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Learning Spatiotemporal Features with 3D</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>