<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Dublin's Participation in the Predicting Media Memorability Task at MediaEval 2018</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alan F. Smeaton</string-name>
          <email>alan.smeaton@dcu.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Owen Corrigan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paul Dockree</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cathal Gurrin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Graham Healy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Feiyan Hu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kevin McGuinness</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eva Mohedano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomás Ward</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Insight Centre for Data Analytics, Dublin City University</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Psychology and Trinity College Institute of Neuroscience, Trinity College Dublin</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>29</fpage>
      <lpage>31</lpage>
      <abstract>
        <p>This paper outlines 6 approaches taken to computing video memorability, for the MediaEval Predicting Media Memorability Task. The approaches are based on video features, an end-to-end approach, saliency, aesthetics, neural feedback, and an ensemble of all approaches.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>In our work we seek to explore theories from psychology and
neuroaesthetics, which may guide predictors for memorability of visual
media. Two caveats are that most of the ideas from neuroaesthetics
come from perception of visual art or artificial experimental stimuli,
rather than real life scenes so these ideas might not translate. The
second caveat is that over and above the aesthetics of the video
or keyframes, we cannot control for the semantic content or the
emotional salience of the imagery for the viewer just as we cannot
control for the viewer’s attention or concentration while initially
viewing or subsequently trying to remember the video.</p>
      <p>
        Our first principle is the idea that aesthetically pleasing features
are driven by Gestalt principles [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] including grouping, symmetry
and lines of good continuation. In each case, items in a scene are
bound together into coherent groups or continuous unbroken forms
by our visual system. According to Ramachandran [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], these Gestalt
principles are driven by neural mechanisms in our perceptual
system that trigger the brain’s reward system so that our attention is
reflexively drawn to these features. There is also some evidence
that grouping of visual features not only increases attention but
also benefits visual working memory [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Our second principle, and in opposition to processing a coherent
whole, is that images that show distinctive figure/ground
arrangements may also capture attention thus promoting memorability. So,
another of Ramachandran’s laws of neuroaesthetics is “isolation” in
which a key visual feature has exaggerated importance and stands
out from the surrounding information [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>Although these aesthetic features are intrinsic qualities in images
that capture attention, it is less clear how they afect
memorability. However superior attention based on these qualities should
increase encoding of the videos and hence improve memorability.
Thus a key prediction based on these principles is that a U-shaped
relationship should emerge in which the most globally coherent
video images and the most locally distinctive images should both be
more memorable compared to the video frames that fall in-between
these extremes – i.e. those that are neither particularly globally
coherent nor locally distinctive.</p>
      <p>
        This work in this paper was carried out in the context of the
2018 MediaEval Predicting Media Memorability task and we refer
the reader to the task description for prior art [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
2
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>RUNS SUBMITTED</title>
    </sec>
    <sec id="sec-3">
      <title>Machine Learning with Pre Computed</title>
    </sec>
    <sec id="sec-4">
      <title>Features</title>
      <p>
        In this run, we evaluated the performance of a neural network to
run on the precomputed features provided by the task organisers.
These features include C3D features, HMP, HOG Descriptors and
more. The complete list can be found in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. To merge these diferent
features, we simply flattened them into one long vector. Using this
as an input, we trained a Multi Layer Perceptron which would
output a probability. We tested a number of architectures and found
in testing that using 3 layers was optimal.
2.2
      </p>
    </sec>
    <sec id="sec-5">
      <title>An End-to-end System</title>
      <p>
        For our end-to-end system we used 3 keyframe images from the raw
videos as inputs. At each epoch, we selected one frame randomly
from the video as a form of data augmentation. For the architecture,
we tried two standard models: VGG16 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and Resnet18 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. We
modified these networks by changing the output to target a single
variable, memorability, instead of matrix of class probabilities. We
also investigated using diferent numbers of dense layers after the
convolutional layers. Surprisingly, we found that using a single
layer with VGG16 gave the best results. Our loss function was
mean squared error, and we used a gradient descent optimizer.
2.3
      </p>
    </sec>
    <sec id="sec-6">
      <title>Using Video and Image Saliency</title>
      <p>
        Visual saliency models generate a probability map highlighting
image regions that most attract human attention. Here, this
information is explored for the task of predicting media memorability.
More precisely, a saliency map for each frame of video is computed
with the SalGAN model [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>The maps are used to spatially weight the activations of the last
convolutional layer of Inception-v3 pre-trained on Imagenet. For
that, video frames are resized to 300 × 300 resolution, and forwarded
to Inception-v3 to generate convolutional volumes of 7 × 7 × 2048
(the first two dimensions correspond to the spatial resolution, and
the last one the number of channels or depth of the layer).</p>
      <p>Saliency maps are downsized to 7 × 7, normalised to contain
values between 0-1, and element-wise multiplied to the
convolutional activations. Global average pooling is applied on the channel
dimension to obtain a final representation of 2048 dimensions. The
hypothesis here is that the denser the saliency map the more human
attention the images draw, and consequently the more memorable
they may be.</p>
      <p>This 2048 long vector was then fed into a neural network, similar
to how precomputed features were used in Section 2.1.
2.4</p>
    </sec>
    <sec id="sec-7">
      <title>Using Neural Approach</title>
      <p>In this approach we used human reaction to a second viewing of a
video keyframe, to train a classifier for memorability, a true
humanin-the-loop experiment. The middle frame was extracted for each
video clip in the test set and a participant was shown these images
at high speed (4 Hz) on a computer screen while simultaneously
recording their EEG (Electroencephalography) signals.</p>
      <p>Each of the 2000 test set extracted images were presented twice.
Following completion of the first viewing, EEG signals were
bandpassed between 0.5 Hz and 10 Hz, re-referenced to a common
average reference and the mean voltage between 300ms and 600ms
following each image presentation calculated for the Pz channel
(baselined to -250ms to 0 ms prior to image presentation). The
participant then viewed the images a second time with similar EEG
data recording and processing and the values averaged for the two
presentations of each image, which formed the submission scores.</p>
      <p>
        These parameters were selected as they are known to correspond
both to a time region and electrode location in which a P300
eventrelated potential in this type of task is typically observed where
attention is elicited [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The rationale is that high amplitude P300
responses correspond to imagery which is visually attentive and
thus potentially more memorable which should also stimulate visual
working memory [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. We then computed the pearson correlation
between the P300 signals and the memorability scores to evalute
the performance of this feature.
2.5
      </p>
    </sec>
    <sec id="sec-8">
      <title>Computing Visual Aesthetics</title>
      <p>
        A final technique we incorporated was to use our own version of an
image aesthetics classifier as described in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], instead of the values
provided by the task organisers. This maps back to our guiding
principles driven by neuroaesthetics, described earlier.
2.6
      </p>
    </sec>
    <sec id="sec-9">
      <title>An Ensemble of All Techniques</title>
      <p>In each of the approaches above we made predictions for the entire
training set, as well as the entire testing set after training had
completed. One limitation to note is that due to the time consuming
nature of EEG labelling in Section 2.4, only a subset of the training
dataset (2,000 videos) was used in this ensemble run. We used
predictions from each of the above approaches, and trained a linear
model on this subset of the training data to identify which were the
most important predictors. We then used these weights to combine
the values on the test set, which generated this run.
3</p>
    </sec>
    <sec id="sec-10">
      <title>RESULTS, CONCLUSIONS AND FUTURE</title>
    </sec>
    <sec id="sec-11">
      <title>PLANS</title>
      <p>The performance results of our submissions are shown in Table 1
and illustrated in Figure 1.</p>
      <p>The results show that the run based on direct neural/EEG
feedback from the human participant was the worst, as expected, and
A.F. Smeaton et al.
part of the reason might be because training was done with on
only 2,000 images, with only one participant. It is definitely worth
scaling up this approach to see performance with more data.</p>
      <p>The run based on our saliency was a bit better than the neural
run, especially for long-term memorability. The ordering of runs
by performance among the provided features, ensemble and
endto-end submissions has contradictions across runs, across long
vs. short term memorability, and across the metric used but the
end-to-end seems to have performed best, which is surprising.</p>
      <p>Overall, our results seem poor for the above reason or because
of insuficient tuning of parameter settings in our experiments.</p>
    </sec>
    <sec id="sec-12">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was partially supported by Science Foundation Ireland
under the SFI Research Centres Programme grant number SFI/12/RC/2289.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cohendet</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Demarty</surname>
            ,
            <given-names>N.Q.</given-names>
          </string-name>
          <string-name>
            <surname>Duong</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Sjöberg</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Ionescu</surname>
          </string-name>
          , and T.-T. Do.
          <year>2018</year>
          .
          <article-title>MediaEval 2018: Predicting Media Memorability</article-title>
          .
          <source>In Proc. of the MediaEval 2018 Workshop</source>
          , Sophia-Antipolis, France. CEUR-WS, Sophia-Antipolis, France,
          <fpage>29</fpage>
          -
          <lpage>31</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE</source>
          ,
          <string-name>
            <surname>Las</surname>
            <given-names>Vegas</given-names>
          </string-name>
          , United States,
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Graham</given-names>
            <surname>Healy</surname>
          </string-name>
          , Tomas Ward, Cathal Gurrin,
          <string-name>
            <given-names>and Alan F.</given-names>
            <surname>Smeaton</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Overview of NTCIR-13 NAILS Task</article-title>
          .
          <article-title>In Proceedings of the NTCIR13 NAILS (Neurally Augmented Image Labelling Strategies)</article-title>
          .
          <source>National Institute of Informatics</source>
          , Japan, Tokyo, Japan,
          <fpage>380</fpage>
          -
          <lpage>383</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Feiyan</given-names>
            <surname>Hu and Alan F. Smeaton</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Image Aesthetics and Content in Selecting Memorable Keyframes from Lifelogs</article-title>
          . In MultiMedia Modeling - 24th International Conference, MMM, Bangkok, Thailand, February 5-
          <issue>7</issue>
          ,
          <year>2018</year>
          , Proceedings, Part I. Springer, Bangkok, Thailand,
          <fpage>608</fpage>
          -
          <lpage>619</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Junting</given-names>
            <surname>Pan</surname>
          </string-name>
          , Cristian Canton-Ferrer,
          <article-title>Kevin McGuinness</article-title>
          ,
          <string-name>
            <surname>Noel E. O'Connor</surname>
          </string-name>
          , Jordi Torres,
          <source>Elisa Sayrol, and Xavier Giró-i-Nieto</source>
          .
          <year>2017</year>
          .
          <article-title>SalGAN: Visual Saliency Prediction with Generative Adversarial Networks</article-title>
          .
          <source>CoRR abs/1701</source>
          .01081 (
          <year>2017</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          . arXiv:
          <volume>1701</volume>
          .01081 http: //arxiv.org/abs/1701.01081
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Dwight</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Peterson</surname>
            and
            <given-names>Marian E.</given-names>
          </string-name>
          <string-name>
            <surname>Berryhill</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>The Gestalt principle of similarity benefits visual working memory</article-title>
          .
          <source>Psychonomic Bulletin &amp; Review</source>
          <volume>20</volume>
          ,
          <issue>6</issue>
          (Dec
          <year>2013</year>
          ),
          <fpage>1282</fpage>
          -
          <lpage>1289</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Vilayanur</surname>
            <given-names>S</given-names>
          </string-name>
          <string-name>
            <surname>Ramachandran</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>The tell-tale brain: A neuroscientist's quest for what makes us human</article-title>
          .
          <source>WW Norton &amp; Company, 500 Fifth Avenue</source>
          , New York, New York.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Vilayanur</surname>
            <given-names>S</given-names>
          </string-name>
          <string-name>
            <surname>Ramachandran and Diane</surname>
          </string-name>
          Rogers-Ramachandran.
          <year>2010</year>
          .
          <article-title>Reading between the Lines</article-title>
          .
          <source>Scientific American Mind</source>
          <volume>21</volume>
          ,
          <issue>4</issue>
          (
          <year>2010</year>
          ),
          <fpage>18</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Karen</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Very Deep Convolutional Networks for Large-Scale Image Recognition</article-title>
          .
          <source>CoRR abs/1409</source>
          .1556 (
          <year>2014</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          . arXiv:
          <volume>1409</volume>
          .1556 http://arxiv.org/abs/1409. 1556
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Todorovic</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Gestalt principles</article-title>
          .
          <source>Scholarpedia</source>
          <volume>3</volume>
          ,
          <issue>12</issue>
          (
          <year>2008</year>
          ),
          <volume>5345</volume>
          . revision #
          <volume>91314</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>