<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Aesthetics and Action Recognition-based Networks for the Prediction of Media Memorability</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mihai Gabriel Constantin</string-name>
          <email>mgconstantin@imag.pub.ro</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chen Kang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriela Dinu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Frédéric Dufaux</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Valenzise</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bogdan Ionescu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fine-Tuning</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CAMPUS, University Politehnica of Bucharest</institution>
          ,
          <country country="RO">Romania</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Laboratoire des Signaux et Systèmes, Université Paris-Sud-CNRS-CentraleSupélec, Université Paris-Saclay</institution>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>27</fpage>
      <lpage>29</lpage>
      <abstract>
        <p>In this working note paper we present the contribution and results of the participation of the UPB-L2S team to the MediaEval 2019 Predicting Media Memorability Task. The task requires participants to develop machine learning systems able to predict automatically whether a video will be memorable for the viewer, and for how long (e.g., hours, or days). To solve the task, we investigated several aesthetics and action recognition-based deep neural networks, either by fine-tuning models or by using them as pre-trained feature extractors. Results from diferent systems were aggregated in various fusion schemes. Experimental results are positive showing the potential of transfer learning for this tasks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Media Memorability was studied extensively in recent years,
playing an important role in the analysis of human perception and
understanding of media content. This domain was approached by
numerous scientists from diferent perspectives and fields of study,
including psychology [
        <xref ref-type="bibr" rid="ref1 ref13">1, 13</xref>
        ] and computer vision [
        <xref ref-type="bibr" rid="ref12 ref3">3, 12</xref>
        ], while
several works analyzed the correlation between memorability and
other visual perception concepts like interestingness and
aesthetics [
        <xref ref-type="bibr" rid="ref6 ref8">6, 8</xref>
        ]. In this context, the MediaEval 2019 Predicting Media
Memorability task requires participants to create systems that can
predict the short-term and long-term memorability of a set of
soundless videos. The dataset, annotation protocol, precomputed features,
and ground truth data are described in the task overview paper [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
      <p>
        For our approach, we used several deep neural network models
based on image aesthetics and action recognition. For the first
category, we fine-tuned the aesthetic deep model presented in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. It
is based on the ResNet-101 architecture [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. For the action
recognition networks, we used features extracted from the I3D [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and
TSN [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] networks and attempted to augment these features with
the C3D features provided by the task organizers. Finally, we
performed some late fusion experiments to further improve the results
of these individual runs. Figure 1 summarizes and presents these
approaches. The approaches are detailed in the following.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Aesthetics networks</title>
      <p>
        The aesthetic-based approach modifies the ResNet-101
architecture [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], trained on the AVA dataset [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] for the prediction of image
I3D
TSN
feCat3uDres
      </p>
      <p>Feature
Extraction</p>
      <p>I3D
features
TSN
features</p>
      <p>PCA</p>
      <p>SVR</p>
      <p>Run 1
Run 2
Run 3</p>
      <p>
        Run 5
LF
Run 4
aesthetic value, following the approach described in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. This
approach generates a deep neural model that can process single image
aesthetics and must be fine-tuned to process the short and long
term memorability of videos. To generate a training dataset that
will support the fine-tuning process, we extracted key-frames in
two ways: (i) key frames from the 4th, 5th, and 6th second of each
sample; (ii) one key frame every two seconds to test multi-frame
training. In the retraining stage of the network for the memorability
task, the provided devset is randomly split into three parts, with
65% of the samples representing the training set, 25% the test set
and 10% the validation set. We adapted the last layer for this task
by creating a fully connected layer with 2,048 inputs and 1 output.
During the fine-tuning process, we applied mean square error as
loss function, using an initial learning rate of 0.0001. We ran the
training process for 15 epochs, with a batch size of 32.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Action recognition networks</title>
      <p>
        Apart from the precomputed C3D features, we extracted the "Mixed_5"
layer from the I3D network [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], trained on the Kinetics dataset [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
and the "Inception_5" layer of the TSN network [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], trained on
the UCF101 dataset [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. These features were used as inputs for a
Support Vector Regression algorithm that generates the final
memorability scores. We conducted preliminary early fusion tests with
combinations of these features in order to select the best possible
combinations, testing both each feature vector individually and all
possible combinations of two feature vectors. We also employed a
PCA dimensionality reduction, reducing the size of each vector to
128 elements. Finally, to train the SVR system, we used a random
4-fold approach, with 75% of the data representing the training set
and 25% representing the validation set. We used parameter tuning
for the SVR model, via a RBF kernel and performing a grid search
with two parameters: the C parameter and the gamma parameter
(taking values 10k , where k ∈ [−4, ..., 4]).
We employed several late fusion schemes on the best performing
systems, trying to benefit from their combined strengths. We used
three diferent strategies for combining these scores, namely: (i)
LFMax, where we took the maximum score for each media sample;
(ii) LFMin, where we took the minimum score; (iii) LFWeight, where
each score from diferent samples was multiplied with a weight w.
We assigned each weight varying values according to the formula
w = 1−r /c, where the rank r had the value 0 for the best performing
system, 1 for the second best and so on, and c represents a coeficient
that dictates rank influence on the weights.
3
      </p>
    </sec>
    <sec id="sec-5">
      <title>EXPERIMENTAL RESULTS</title>
      <p>The development dataset consists of 8,000 videos, annotated with
short and long term memory scores, while the test dataset consists
of 2,000 videos. The oficial metric used in the task is Spearman’s
rank correlation (ρ). The best performing systems in the
development phase are selected, retrained on the whole devset by using
the optimal parameters and lastly run on the testset data.
3.1</p>
    </sec>
    <sec id="sec-6">
      <title>Results on the devset</title>
      <p>During the tests performed on the devset, several systems and
combinations of parameters stood out as best performers. Table 1
shows the performances recorded by the best performing aesthetic,
action-based, and late fusion systems.</p>
      <p>We used several dataset variations in retraining the
aestheticbased deep network. More precisely, we found that, for the
shortterm memorability, the best performing systems were the ones
trained with keyframes extracted from the 5th second and the ones
extracted from the multi-frame approach. The results were both
similar with a Spearman’s ρ of 0.45. On the other hand, in the
long-term memorability subtask we found that the best
performing systems were the ones trained with keyframes from the 5th
frame. Although this may seem somewhat surprising, giving that
bigger data sets usually account for better results, we believe that
the reason behind this is that each video contains only one scene.
Therefore not much additional information is given to the system
when more frames are extracted because the frames are very similar.
However, we would also like to point out that the results for the
other frame extraction schemes were not much lower than these.</p>
      <p>Regarding the 3D action-recognition based systems, we noticed
that individual systems, based on only one feature vector (TSN,
I3D or C3D) had a low performance, with a Spearman’s ρ score
of under 0.42. This performance further dropped when we used
the original vectors, without applying PCA reduction, therefore
demonstrating the positive influence that dimensionality reduction
has on the final results. Therefore we decided to apply an early
fusion scheme, where we tested all the possible combinations of
the feature vectors, by concatenating them. The best performing
combinations were TSN + I3D and C3D + I3D.</p>
      <p>
        Finally, in the late fusion part of the experiment, we generally
decided to test late fusion schemes between the two action-recognition
based systems and between the best performing action-recognition
system (TSN + I3D) and the aesthetic-based system. In general,
results for the LFMin systems were underperforming, while the
LFMax systems were better than their components, but without
bringing a significant increase in results. The best performing late
fusion schemes proved to be based on LFWeight, more precisely
using a c value of 5. This was an expected result, as it confirms
some of our previous work in other MediaEval tasks [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
3.2
      </p>
    </sec>
    <sec id="sec-7">
      <title>Results on the testset</title>
      <p>For the final phase, we retrained all the systems on the entire set of
videos from devset, using the parameters computed in the previous
phases and tested them on the videos from the testset. Table 1
presents also the results for this phase.</p>
      <p>As expected, the best performance comes from a late fusion
system using both aesthetic and action-based components
(shortterm ρ = 0.477 and long-term ρ = 0.232). Generally, we observe
that the system ranking for the submitted systems is consistent with
the one we observed during the development phase, however, the
results are lower than those predicted then, with significant drops
in performance for the aesthetic-based system and the action-based
(C3D + I3D) approaches. In terms of single-system performance,
the action-based TSN + I3D system performs best, followed by the
aesthetic-based system.
4</p>
    </sec>
    <sec id="sec-8">
      <title>CONCLUSIONS</title>
      <p>In this paper we presented the UPB-L2S approach for predicting
media memorability at MediaEval. We created a framework that
uses aesthetic and action recognition based systems and some late
fusion combinations of these systems, that predict short-term and
long-term memorability scores for soundless video samples. The
results show that these systems are able to individually predict these
scores, while the best results are achieved via late fusion weighted
schemes. This enforces the idea of better exploiting transfer learning
to tasks where labeled data are in particular hard to obtain.</p>
    </sec>
    <sec id="sec-9">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was partially supported by the Romanian Ministry of
Innovation and Research (UEFISCDI, project SPIA-VA, agreement
2SOL/2017, grant PN-III-P2-2.1-SOL-2016-02-0002).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Timothy</surname>
            <given-names>F Brady</given-names>
          </string-name>
          ,
          <article-title>Talia Konkle, George A Alvarez,</article-title>
          and
          <string-name>
            <given-names>Aude</given-names>
            <surname>Oliva</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Visual long-term memory has a massive storage capacity for object details</article-title>
          .
          <source>Proceedings of the National Academy of Sciences</source>
          <volume>105</volume>
          ,
          <issue>38</issue>
          (
          <year>2008</year>
          ),
          <fpage>14325</fpage>
          -
          <lpage>14329</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Joao</given-names>
            <surname>Carreira</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Quo vadis, action recognition? a new model and the kinetics dataset</article-title>
          .
          <source>In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          .
          <fpage>6299</fpage>
          -
          <lpage>6308</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Romain</given-names>
            <surname>Cohendet</surname>
          </string-name>
          ,
          <string-name>
            <surname>Claire-Hélène</surname>
            <given-names>Demarty</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ngoc Q. K. Duong</surname>
            , and
            <given-names>Martin</given-names>
          </string-name>
          <string-name>
            <surname>Engilberge</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>VideoMem: Constructing, Analyzing, Predicting Short-term and Long-term Video Memorability</article-title>
          .
          <source>In International Conference on Computer Vision</source>
          (ICCV).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Mihai</given-names>
            <surname>Gabriel</surname>
          </string-name>
          <string-name>
            <surname>Constantin</surname>
          </string-name>
          , Bogdan Andrei Boteanu, and
          <string-name>
            <given-names>Bogdan</given-names>
            <surname>Ionescu</surname>
          </string-name>
          .
          <year>2017</year>
          . LAPI at MediaEval 2017-
          <article-title>Predicting Media Interestingness.</article-title>
          .
          <source>In MediaEval.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Mihai</given-names>
            <surname>Gabriel</surname>
          </string-name>
          <string-name>
            <given-names>Constantin</given-names>
            , Bogdan Ionescu,
            <surname>Claire-Hélène</surname>
          </string-name>
          <string-name>
            <given-names>Demarty</given-names>
            ,
            <surname>Ngoc Q. K. Duong</surname>
          </string-name>
          , Xavier Alameda-Pineda, and
          <string-name>
            <given-names>Mats</given-names>
            <surname>Sjöberg</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Predicting Media Memorability Task at MediaEval 2019</article-title>
          .
          <source>In Proc. of MediaEval 2019 Workshop</source>
          , Sophia Antipolis, France, Oct.
          <volume>27</volume>
          -
          <fpage>29</fpage>
          ,
          <year>2019</year>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Mihai</given-names>
            <surname>Gabriel</surname>
          </string-name>
          <string-name>
            <surname>Constantin</surname>
          </string-name>
          , Miriam Redi, Gloria Zen, and
          <string-name>
            <given-names>Bogdan</given-names>
            <surname>Ionescu</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Computational understanding of visual interestingness beyond semantics: literature survey and analysis of covariates</article-title>
          .
          <source>ACM Computing Surveys (CSUR) 52</source>
          ,
          <issue>2</issue>
          (
          <year>2019</year>
          ),
          <fpage>25</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          .
          <volume>770</volume>
          -
          <fpage>778</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Phillip</given-names>
            <surname>Isola</surname>
          </string-name>
          , Jianxiong Xiao, Devi Parikh, Antonio Torralba, and
          <string-name>
            <given-names>Aude</given-names>
            <surname>Oliva</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>What makes a photograph memorable? IEEE transactions on pattern analysis</article-title>
          and
          <source>machine intelligence 36</source>
          ,
          <issue>7</issue>
          (
          <year>2013</year>
          ),
          <fpage>1469</fpage>
          -
          <lpage>1482</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Chen</given-names>
            <surname>Kang</surname>
          </string-name>
          , Giuseppe Valenzise, and
          <string-name>
            <given-names>Frédéric</given-names>
            <surname>Dufaux</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Predicting Subjectivity in Image Aesthetics Assessment</article-title>
          .
          <source>In IEEE 21st International Workshop on Multimedia Signal Processing</source>
          ,
          <fpage>27</fpage>
          -
          <lpage>29</lpage>
          Sept 2019,
          <string-name>
            <given-names>Kuala</given-names>
            <surname>Lumpur</surname>
          </string-name>
          , Malaysia.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Will</surname>
            <given-names>Kay</given-names>
          </string-name>
          , Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, and others.
          <year>2017</year>
          .
          <article-title>The kinetics human action video dataset</article-title>
          .
          <source>arXiv preprint arXiv:1705.06950</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Naila</surname>
            <given-names>Murray</given-names>
          </string-name>
          , Luca Marchesotti, and
          <string-name>
            <given-names>Florent</given-names>
            <surname>Perronnin</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>AVA: A large-scale database for aesthetic visual analysis</article-title>
          .
          <source>In 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE</source>
          ,
          <fpage>2408</fpage>
          -
          <lpage>2415</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Sumit</surname>
            <given-names>Shekhar</given-names>
          </string-name>
          , Dhruv Singal, Harvineet Singh,
          <string-name>
            <given-names>Manav</given-names>
            <surname>Kedia</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Akhil</given-names>
            <surname>Shetty</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Show and recall: Learning what makes videos memorable</article-title>
          .
          <source>In Proceedings of the IEEE International Conference on Computer Vision</source>
          . 2730-
          <fpage>2739</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Roger</surname>
            <given-names>N</given-names>
          </string-name>
          <string-name>
            <surname>Shepard</surname>
          </string-name>
          .
          <year>1967</year>
          .
          <article-title>Recognition memory for words, sentences, and pictures</article-title>
          .
          <source>Journal of verbal Learning and verbal Behavior</source>
          <volume>6</volume>
          ,
          <issue>1</issue>
          (
          <year>1967</year>
          ),
          <fpage>156</fpage>
          -
          <lpage>163</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Khurram</surname>
            <given-names>Soomro</given-names>
          </string-name>
          , Amir Roshan Zamir, and
          <string-name>
            <given-names>Mubarak</given-names>
            <surname>Shah</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>UCF101: A dataset of 101 human actions classes from videos in the wild</article-title>
          .
          <source>arXiv preprint arXiv:1212.0402</source>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Limin</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Yuanjun Xiong,
          <string-name>
            <surname>Zhe</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            <given-names>Qiao</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Dahua</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Xiaoou</given-names>
            <surname>Tang</surname>
          </string-name>
          , and Luc Van Gool.
          <year>2016</year>
          .
          <article-title>Temporal segment networks: Towards good practices for deep action recognition</article-title>
          .
          <source>In European conference on computer vision</source>
          . Springer,
          <fpage>20</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>