<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Predicting Media Memorability from a Multimodal Late Fusion of Self-Atention and LSTM Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ricardo Kleinlein</string-name>
          <email>ricardo.kleinlein@upm.es</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cristina Luna-Jiménez</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zoraida Callejas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fernando Fernández-Martínez</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Languages and Computer Systems, University of Granada</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Speech Technology Group, Center for Information Processing and Telecommunications, E.T.S.I. de Telecomunicación, Universidad Politécnica de Madrid</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper reports on the GTH-UPM team experience in the Predicting Media Memorability task at MediaEval 2020. Teams were requested to predict memorability scores at both short-term and long-term, understanding such score as a measure of whether a video was perdurable in a viewer's memory or not. Our proposed system relies on a late fusion of the scores predicted by three sequential models, each trained over a diferent modality: video captions, aural embeddings and visual optical flow-based vectors. Whereas single-modality models show a low or zero Spearman correlation coeficient value, their combination considerably boosts performance over development data up to 0.2 in the short-term memorability prediction subtask and 0.19 in the long-term subtask. However, performance over test data drops to 0.016 and -0.041, respectively.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The improvement in computational capabilities is progressively
allowing researchers to tackle problems long though to be out of
reach due to the subjective nature of the phenomena involved. One
good instance is memorability prediction. The seminal work of Isola
et al. set the ground for later work on computational modelling of
image memorability [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Since 2018 the Predicting Media
Memorability Challenge, hosted within the MediaEval workshop, has pushed
forward the extent of the original problem to encompass
memorability prediction over multimedia sources of information [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. In
its current edition the goal of the task holds the same as previous
years, yet video clips now cover a kind of material resembling short
videos commonly found in social media. Further information can
be found in the challenge description paper [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Several multimodal late fusion strategies have been proposed
regarding the image and video memorability prediction problem [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Additionally, attention mechanisms have been successfully applied
to problems in which data come naturally in a sequential form [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
In particular, self-attention layers have been proved to boost
performance when tackling the computational modelling of media
memorability [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH AND EXPERIMENTS</title>
      <p>Every video sample in the dataset presents the following sources of
information: between 2 and 5 text captions that roughly describe the
content of the video, the video audio signal and its visual frames. As
stated before, multimodal systems are able to learn modality-wise
data representations, and combine their predictive power in order
to make a final, unique memorability prediction. We hypothesize
that a late fusion scheme will benefit from incorporating a
selfattention mechanism that learns to focus on what it is particularly
relevant on a given sample’s prediction.</p>
      <p>
        We propose a system based on the late fusion by a Support
Vector Regressor (SVR) of the predictions made by three
singlemodality models whose architecture is depicted in Figure 2. In
all cases the biLSTM encoders have 75 units, with all the learners
sharing the same architecture but trained independently. Prediction
comes as the outcome of the last sigmoid layer. Learned layers
sufer from a dropout rate fixed at 0.3. For every single-modality
learner the training pipeline holds the same; batch size is set at 128,
with initial learning rate 0.001 and Adam optimizer [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Figure 1
shows the general prediction pipeline from these models. Results
shown in this paper are obtained following a 5-fold cross-validation
procedure over the 1000 videos of the development data. Training
is stopped after 5 epochs with no improvement over the Spearman
correlation coeficient, computed over the fold’s validation data.
Experimental results are summarized in Table 1. Next we introduce
in greater detail the feature extraction processing carried out for
every modality.
2.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Text captions</title>
      <p>
        We merge all the captions of a sample into a single one in a
Bag-OfWords fashion. Afterwards, we extract the lemma of every word in
the text using NLTK’s WordNet-based Lemmatizer [
        <xref ref-type="bibr" rid="ref1 ref14">1, 14</xref>
        ]. Finally,
the input of the text modality is made by the sequence of fasttext
300-dimensional word embeddings corresponding to every word
in the sample’s BOW-text [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. At training time, random noise with
 = 0 and  = 0.15 is added to the niput embeddings in order to
improve learning robustness.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Audio signal</title>
      <p>
        Based on previous experience, we hypothesize that event
detectionoriented embeddings provide a robust basis to study multimedia
perceptual variables such as attention or memorability [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Therefore
we compute aural embeddings using the default VGGish
configuration, which is pretrained on Audioset, a large audio event-detection
database [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ]. That way every video audio signal is defined by a
sequence of 128-dimensional embeddings, each spanning 960 ms
of audio and without overlap between them.
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Video image</title>
      <p>
        Videos in the dataset are no longer than a few seconds, characterized
by an event happening quickly and conforming the most relevant
part of the clip. Because of that, videos are expected to display
quick changes in pixel values between consecutive frames due to
visual events taking place. In order to capture the degree of visual
change along a clip, we compute optical feature maps for its frames,
extracted at 3 FPS, using a LiteFlowNet model [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. We further
reduce optical flow features’ dimensionality by projecting them
into a 128-dimensional subspace computed by PCA [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. A sample
is represented by a temporally-sorted sequence of 128-dimensional
features that retains most of the information regarding the optical
lfow features maps.
2.4
      </p>
    </sec>
    <sec id="sec-6">
      <title>Ensemble of modality-wise models</title>
      <p>We independently train single-modality models from the features
explained in the sections above. Thereafter, a memorability
prediction is computed for every sample in the dataset. The combination
of the three memorability scores is the input for a SVR that makes
a final prediction that reflects the knowledge extracted from the
diferent the modalities.</p>
      <p>As it can be seen from Table 1, individual learners are not able
to fully characterize a video sample and learn the relationship with
its memorability score. However, the ensemble of the three of them
achieves a Spearman correlation coeficient value of 0.2 in the
shortterm problem and 0.19 in the long-term one over development</p>
    </sec>
    <sec id="sec-7">
      <title>DISCUSSION AND OUTLOOK</title>
      <p>Despite individual learners showing very low or even zero
coefifcient values, a SVR based on their posteriors seems to weakly
capture the relationship between media content and its
memorability score, with similar correlation values obtained at both
shortterm and long-term subtasks. This might be partially caused by the
limited amount of data available, which is likely to be dragging
the learning process, and therefore making the SVR to learn the
development dataset’s score distribution. Prediction’s distribution
suggests that the system might be learning to approximate every
sample to the mean memorability score, rather than exploiting the
knowledge extracted from the computed features. Future work
includes extending the amount of training data with similar datasets.
It is also left for future studies to explore diferent data encodings,
with special emphasis on smaller, more compact data
representations that might better suited for cases where large datasets are not
available.</p>
    </sec>
    <sec id="sec-8">
      <title>ACKNOWLEDGMENTS</title>
      <p>The work leading to these results has been supported by the
Spanish Ministry of Economy, Industry and Competitiveness through
CAVIAR (MINECO, TEC2017-84593-C2-1-R) and AMIC (MINECO,
TIN2017-85854-C4-4-R) projects (AEI/FEDER, UE). Ricardo
Kleinlein’s research was supported by the Spanish Ministry of Education
(FPI grant PRE2018-083225).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Steven</given-names>
            <surname>Bird</surname>
          </string-name>
          , Ewan Klein, and
          <string-name>
            <given-names>Edward</given-names>
            <surname>Loper</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Natural Language Processing with Python</article-title>
          .
          <source>O'Reilly Media.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Piotr</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          , Edouard Grave, Armand Joulin, and
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Enriching Word Vectors with Subword Information</article-title>
          .
          <source>arXiv preprint arXiv:1607.04606</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Romain</given-names>
            <surname>Cohendet</surname>
          </string-name>
          ,
          <string-name>
            <surname>Claire-Hélène</surname>
            <given-names>Demarty</given-names>
          </string-name>
          , Ngoc Duong, Mats Sjöberg, Bogdan Ionescu,
          <string-name>
            <surname>Thanh-Toan Do</surname>
          </string-name>
          , and France Rennes.
          <year>2018</year>
          .
          <article-title>MediaEval 2018: Predicting Media Memorability Task</article-title>
          . (
          <year>2018</year>
          ).
          <article-title>arXiv:cs</article-title>
          .CV/
          <year>1807</year>
          .01052
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Mihai-Gabriel</surname>
            <given-names>Constantin</given-names>
          </string-name>
          , Bogdan Ionescu,
          <string-name>
            <surname>Claire-Hélène</surname>
            <given-names>Demarty</given-names>
          </string-name>
          , Ngoc Duong, Xavier Alameda-Pineda, and
          <string-name>
            <given-names>Mats</given-names>
            <surname>Sjöberg</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>The Predicting Media Memorability Task at MediaEval</article-title>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Mihai</given-names>
            <surname>Gabriel</surname>
          </string-name>
          <string-name>
            <surname>Constantin</surname>
          </string-name>
          , Chen Kang, Gabriela Dinu, Frédéric Dufaux, Giuseppe Valenzise, and
          <string-name>
            <given-names>Bogdan</given-names>
            <surname>Ionescu</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Using Aesthetics and Action Recognition-Based Networks for the Prediction of Media Memorability</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2019 Workshop</source>
          , Sophia Antipolis, France,
          <fpage>27</fpage>
          -
          <issue>30</issue>
          <year>October 2019</year>
          (CEUR Workshop Proceedings),
          <source>Martha A. Larson</source>
          , Steven Alexander Hicks, Mihai Gabriel Constantin, Benjamin Bischke, Alastair Porter,
          <string-name>
            <given-names>Peijian</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Mathias</given-names>
            <surname>Lux</surname>
          </string-name>
          , Laura Cabrera Quiros,
          <string-name>
            <surname>Jordan Calandre</surname>
          </string-name>
          , and Gareth Jones (Eds.), Vol.
          <volume>2670</volume>
          .
          <article-title>CEUR-WS.org</article-title>
          . http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2670</volume>
          /MediaEval_19_ paper_60.pdf
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Jiri</given-names>
            <surname>Fajtl</surname>
          </string-name>
          , Vasileios Argyriou, Dorothy Monekosso, and
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Remagnino</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>AMNet: Memorability Estimation with Attention</article-title>
          .
          <article-title>(</article-title>
          <year>2018</year>
          ). arXiv:cs.
          <source>AI</source>
          /
          <year>1804</year>
          .03115
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Alba</given-names>
            <surname>García Seco de Herrera</surname>
          </string-name>
          , Rukiye Savran Kiziltepe, Jon Chamberlain, Mihai Gabriel Constantin,
          <string-name>
            <surname>Claire-Hélène</surname>
            <given-names>Demarty</given-names>
          </string-name>
          , Faiyaz Doctor, Bogdan Ionescu,
          <string-name>
            <given-names>and Alan F.</given-names>
            <surname>Smeaton</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Overview of MediaEval 2020 Predicting Media Memorability task: What Makes a Video Memorable?</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2020 Workshop.</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Gemmeke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. P. W.</given-names>
            <surname>Ellis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Freedman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jansen</surname>
          </string-name>
          , W. Lawrence,
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Moore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Plakal</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Ritter</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Audio Set: An ontology and human-labeled dataset for audio events</article-title>
          .
          <source>In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          .
          <volume>776</volume>
          -
          <fpage>780</fpage>
          . https://doi.org/10.1109/ICASSP.
          <year>2017</year>
          .7952261
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Shawn</given-names>
            <surname>Hershey</surname>
          </string-name>
          , Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen,
          <string-name>
            <given-names>R. Channing</given-names>
            <surname>Moore</surname>
          </string-name>
          , Manoj Plakal, Devin Platt, Rif A.
          <string-name>
            <surname>Saurous</surname>
            , Bryan Seybold, Malcolm Slaney,
            <given-names>Ron J.</given-names>
          </string-name>
          <string-name>
            <surname>Weiss</surname>
          </string-name>
          , and Kevin Wilson.
          <year>2017</year>
          .
          <article-title>CNN Architectures for Large-Scale Audio Classiifcation</article-title>
          . (
          <year>2017</year>
          ).
          <source>arXiv:cs.SD/1609.09430</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Tak-Wai</surname>
            <given-names>Hui</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Xiaoou</given-names>
            <surname>Tang</surname>
          </string-name>
          , and Chen Change Loy.
          <year>2018</year>
          .
          <article-title>LiteFlowNet: A Lightweight Convolutional Neural Network for Optical Flow Estimation</article-title>
          .
          <source>In IEEE Conference on Computer Vision</source>
          and Pattern Recognition.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Phillip</surname>
            <given-names>Isola</given-names>
          </string-name>
          , Jianxiong Xiao, Devi Parikh, Antonio Torralba, and
          <string-name>
            <given-names>Aude</given-names>
            <surname>Oliva</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>What makes a photograph memorable? Pattern Analysis and Machine Intelligence</article-title>
          ,
          <source>IEEE Transactions on 36, 7</source>
          (
          <year>2014</year>
          ),
          <fpage>1469</fpage>
          -
          <lpage>1482</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Diederik</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Kingma</surname>
            and
            <given-names>Jimmy</given-names>
          </string-name>
          <string-name>
            <surname>Ba</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Adam: A Method for Stochastic Optimization</article-title>
          . (
          <year>2017</year>
          ).
          <source>arXiv:cs.LG/1412.6980</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Ricardo</surname>
            <given-names>Kleinlein</given-names>
          </string-name>
          , Cristina Luna Jiménez, Juan Manuel Montero, Zoraida Callejas, and
          <string-name>
            <surname>Fernando</surname>
          </string-name>
          Fernández-Martínez.
          <year>2019</year>
          .
          <article-title>Predicting Group-Level Skin Attention to Short Movies from Audio-Based LSTM-Mixture of Experts Models</article-title>
          .
          <source>In Proc. Interspeech</source>
          <year>2019</year>
          .
          <fpage>61</fpage>
          -
          <lpage>65</lpage>
          . https://doi.org/10.21437/Interspeech.2019-
          <fpage>2799</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>George</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Miller</surname>
          </string-name>
          .
          <year>1995</year>
          .
          <article-title>WordNet: A Lexical Database for English</article-title>
          .
          <source>COMMUNICATIONS OF THE ACM</source>
          <volume>38</volume>
          (
          <year>1995</year>
          ),
          <fpage>39</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Karl</given-names>
            <surname>Pearson</surname>
          </string-name>
          .
          <year>1901</year>
          . LIII.
          <article-title>On lines and planes of closest fit to systems of points in space</article-title>
          .
          <source>The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science</source>
          <volume>2</volume>
          ,
          <issue>11</issue>
          (
          <year>1901</year>
          ),
          <fpage>559</fpage>
          -
          <lpage>572</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Ashish</surname>
            <given-names>Vaswani</given-names>
          </string-name>
          , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
          <string-name>
            <given-names>Aidan N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Lukasz Kaiser, and
          <string-name>
            <given-names>Illia</given-names>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <year>2017</year>
          . Attention Is All You Need. (
          <year>2017</year>
          ).
          <source>arXiv:cs.CL/1706.03762</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>