<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Predicting Memorability via Early Fusion Deep Neural Network</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aaron Weiss</string-name>
          <email>weissa7@tcnj.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benjamin Sang</string-name>
          <email>sangb1@tcnj.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sejong Yoon</string-name>
          <email>yoons@tcnj.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The College of New Jersey</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>29</fpage>
      <lpage>31</lpage>
      <abstract>
        <p>In this working note, we present our approach and investigation on the MedialEval 2018 Predicting Media Memorability Task. We used a portion of the features provided, while also employed additional features. Two diferent training approaches were attempted to train a deep neural network architecture, fusing multiple features we used. Oficial results, as well as our investigation on the task data are provided.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>101 à 50
109 à 50
Dropout
Dropout
MemNet
Caption
1
Dropout
100
à 50
Dropout
Dropout
(1000x3) à
1000
Dropout
Dropout
Dropout
Dropout</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>
        MediaEval 2018 Predicting Media Memorability [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is a new
multimedia analysis task following up from previous years of media
interestingness prediction challenges [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. It consists of two subtasks.
In the first task, the system should predict whether the viewer will
remember a video in the short-term (minutes). The second subtask
was for the system to predict whether the viewer will remember
a video in the long-term (24-72 hours). Within the total of 10,000
videos that were annotated, 8,000 of them were provided as
devset, and the remaining 2,000 videos were reserved for the test-set.
Details of the annotation protocol and the prior work survey can
be found in the task overview paper [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>APPROACH</title>
      <p>In this section, we first describe the features we employed and then
present our method.</p>
    </sec>
    <sec id="sec-4">
      <title>Features</title>
      <p>
        We used many of the provided features, including Aesthetic visual
features[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], the final classification layer of the C3D[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] model, Color
Histogram in HSV space, Histogram of Motion Patterns[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and
the outputs of the f c7 layer of the InceptionV3[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] deep neural
network. We also employed two additional features:
      </p>
      <p>
        Image Memorability Prediction. Extracted three frames from
every video, at the time stamps 0.5, 3.0, and 5.5 seconds. For 7
second videos, this results in good coverage of the entire video in the
case of rapidly changing scenes. Then, we used MemNet [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
image memorability prediction model to extract image memorability
scores of the three frames. Finally, the three prediction scores were
averaged as a memorability score prediction for the entire video.
      </p>
      <p>
        Caption. Following a prior work [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], we considered utilizing
caption data provided in the dataset. Given the textual metadata per
video, we generated a feature vector using Google’s Word2Vec [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
model. This yields a 300-dimensional vector for each word within
∗A. Weiss and B. Sang equally contributed to this work.
the provided video caption. Then, the vectors in each video were
averaged to create one vector per video as a feature.
2.2
      </p>
      <p>Feature Fusion via Concatenation
Given the described features, the key task is to find the best
combination/subset of the features that correlates well to the video
memorability score. In this work, we tried deep neural network
stacking multiple fully connected layers with modern regularization
techniques. Fig. 1 depicts our network structure.</p>
      <p>
        Our network design focused on two aspects: (a) include
suficient number of layers for input features with high dimensions
Color
Histogram
(3x3x256) à
(3x256)
Dropout
(3x256)
à 256
Dropout
HMP
Dropout
Dropout
so that subtle but important variations are not ignored and (b)
all features are equally treated, and important but small
dimensional features are not overwhelmed by the other large features.
Each linear weight followed by a ReLU [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] activation function
and a dropout regularization [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. We used 0.5 for all dropout rates.
The network hyperparameters were determined by preliminary
experiments and Table 2 summarizes the results using each feature
individually. Several methods have been proposed for feature
fusion in deep neural networks, particularly for convolutional neural
nets [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. After some preliminary trials, we decided to use the
simple concatenation as no significant diference was found.
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Pre-training Layers</title>
      <p>
        One of the well-known issues of the deep neural network training
is the vanishing gradient problem [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. While we used ReLU to
alleviate the problem, we found that our network in Fig. 1 easily get
stuck during the training. To speed up the training, borrowing the
idea from transfer learning, we pre-trained the lower layers before
the concatenation. We denote the network without pre-training as
model A and the one with pre-training as model B. As evident from
      </p>
    </sec>
    <sec id="sec-6">
      <title>DISCUSSION AND OUTLOOK</title>
      <p>
        Overall results on our submissions are summarized in Table 1. We
used ADAM for the optimization [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and for most of the cases, we
used the default learning rates of 0.001. Due to schedule constraints,
we did not include Caption features in the submitted methods A and
B. We report cross validation result including the Caption feature
with the best-performing configuration (with pre-trained layers) as
method C. It is clear from the result, that the pre-training approach
B showed more balanced generalization performance, regardless of
dev-set/test-set split. Moreover, B shows consistent performance
improvement over increasing training epoch, indicating that the
model is being trained in the right direction.
      </p>
      <p>
        On the downside, several challenges were identified. First, the
performance of the feature-fused network did not improve much
over individual features. Only when high level information, e.g.
caption, is involved, we reached the baseline performance. As reported
in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], it is clear that high level pre-processing is essential to achieve
a reasonable performance. One may consider late fusion instead of
early fusion, for some of the features we considered, e.g. MemNet.
Second, long-term video memorability is more dificult to predict
than the short-term one. From our experiments, it was unclear
which feature, would improve the long-term video memorability
prediction as all of them yielded poor performance. Even the high
level semantic features struggled in this case. This is not surprising
given the true long-term memorability scores are 1 (memorable for
all annotators) in many cases. More robust prediction model, that
can distinguish subtle diferences might be needed for this subtask.
      </p>
    </sec>
    <sec id="sec-7">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was supported in part by The College of New Jersey
under Support Of Scholarly Activity (SOSA) 2017-2019 grant. The
authors acknowledge use of the ELSA high performance computing
cluster at The College of New Jersey for conducting the research
reported in this paper. This cluster is funded by the National Science
Foundation under grant number OAC-1828163.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. J.</given-names>
            <surname>Leite</surname>
          </string-name>
          , and R. da
          <string-name>
            <given-names>S.</given-names>
            <surname>Torres</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Comparison of video sequences with histograms of motion patterns</article-title>
          .
          <source>In 2011 18th IEEE International Conference on Image Processing</source>
          .
          <fpage>3673</fpage>
          -
          <lpage>3676</lpage>
          . https: //doi.org/10.1109/ICIP.
          <year>2011</year>
          .6116516
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Boyaci</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Sert</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Feature-level fusion of deep convolutional neural networks for sketch recognition on smartphones</article-title>
          .
          <source>In 2017 IEEE International Conference on Consumer Electronics (ICCE)</source>
          .
          <volume>466</volume>
          -
          <fpage>467</fpage>
          . https://doi.org/10.1109/ICCE.
          <year>2017</year>
          .7889398
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>Shi-Qi</surname>
          </string-name>
          , Zhan, Rong-Hui, Hu, Jie-Min, and Zhang, Jun.
          <year>2017</year>
          .
          <article-title>Feature Fusion Based on Convolutional Neural Network for SAR ATR</article-title>
          .
          <source>ITM Web Conf</source>
          .
          <volume>12</volume>
          (
          <year>2017</year>
          ),
          <volume>05001</volume>
          . https://doi.org/10.1051/itmconf/ 20171205001
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Romain</given-names>
            <surname>Cohendet</surname>
          </string-name>
          ,
          <string-name>
            <surname>Claire-Hélène</surname>
            <given-names>Demarty</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ngoc Q. K. Duong</surname>
          </string-name>
          , Mats Sjöberg, Bogdan Ionescu, and
          <string-name>
            <surname>Thanh-Toan Do</surname>
          </string-name>
          .
          <source>MediaEval</source>
          <year>2018</year>
          :
          <article-title>Predicting Media Memorability</article-title>
          .
          <source>In Proc. of MediaEval 2018 Workshop</source>
          , Sophia Antipolis, France, Oct.
          <volume>29</volume>
          -
          <fpage>31</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Romain</given-names>
            <surname>Cohendet</surname>
          </string-name>
          , Karthik Yadati,
          <string-name>
            <surname>Ngoc Q.K. Duong</surname>
            , and
            <given-names>ClaireHélène</given-names>
          </string-name>
          <string-name>
            <surname>Demarty</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Annotating, understanding, and predicting long-term video memorability</article-title>
          .
          <source>In Proc. of the ICMR 2018 Workshop</source>
          , Yokohama, Japan, June 11-14.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Claire-Hélène</surname>
            <given-names>Demarty</given-names>
          </string-name>
          , Mats Sjöberg, Bogdan Ionescu,
          <string-name>
            <surname>Thanh-Toan</surname>
            <given-names>Do</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Gygli</surname>
          </string-name>
          , and
          <string-name>
            <surname>Ngoc</surname>
            <given-names>Q. K.</given-names>
          </string-name>
          <string-name>
            <surname>Duong</surname>
          </string-name>
          .
          <article-title>Predicting Media Interestingness Task at MediaEval 2017</article-title>
          .
          <source>In Proc. of MediaEval 2017 Workshop</source>
          , Dublin, Ireland, Sept.
          <fpage>13</fpage>
          -
          <lpage>15</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Andreas</surname>
            <given-names>F Haas</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marine Guibert</surname>
            , Anja Foerschner, Sandi Calhoun, Emma George, Mark Hatay, Elizabeth Dinsdale, Stuart A Sandin, Jennifer E Smith,
            <given-names>Mark JA</given-names>
          </string-name>
          <article-title>Vermeij, and</article-title>
          <string-name>
            <surname>others.</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>Can we measure beauty? Computational evaluation of coral reef aesthetics</article-title>
          .
          <source>PeerJ</source>
          <volume>3</volume>
          ,
          <issue>e1390</issue>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Geofrey</surname>
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Hinton</surname>
            , Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and
            <given-names>Ruslan</given-names>
          </string-name>
          <string-name>
            <surname>Salakhutdinov</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Improving neural networks by preventing co-adaptation of feature detectors</article-title>
          .
          <source>CoRR abs/1207</source>
          .0580 (
          <year>2012</year>
          ). arXiv:
          <volume>1207</volume>
          .0580 http://arxiv.org/abs/1207.0580
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Sepp</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          .
          <year>1991</year>
          .
          <article-title>Untersuchungen zu dynamischen neuronalen Netzen</article-title>
          . Diploma,
          <source>Technische Universität München</source>
          <volume>91</volume>
          ,
          <issue>1</issue>
          (
          <year>1991</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Aditya</surname>
            <given-names>Khosla</given-names>
          </string-name>
          , Akhil S. Raju, Antonio Torralba, and
          <string-name>
            <given-names>Aude</given-names>
            <surname>Oliva</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Understanding and Predicting Image Memorability at a Large Scale</article-title>
          .
          <source>In International Conference on Computer Vision</source>
          (ICCV).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Diederik</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Kingma</surname>
            and
            <given-names>Jimmy</given-names>
          </string-name>
          <string-name>
            <surname>Ba</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Adam: A Method for Stochastic Optimization</article-title>
          .
          <source>CoRR abs/1412</source>
          .6980 (
          <year>2014</year>
          ). arXiv:
          <volume>1412</volume>
          .6980 http://arxiv.org/abs/1412.6980
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Tomas</surname>
            <given-names>Mikolov</given-names>
          </string-name>
          , Kai Chen, Greg Corrado, and
          <string-name>
            <given-names>Jefrey</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Efifcient Estimation of Word Representations in Vector Space</article-title>
          .
          <source>CoRR abs/1301</source>
          .3781 (
          <year>2013</year>
          ). arXiv:
          <volume>1301</volume>
          .3781 http://arxiv.org/abs/1301.3781
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Vinod</given-names>
            <surname>Nair</surname>
          </string-name>
          and
          <string-name>
            <given-names>Geofrey E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Rectified Linear Units Improve Restricted Boltzmann Machines</article-title>
          .
          <source>In Proceedings of the 27th International Conference on International Conference on Machine Learning (ICML'10)</source>
          . Omnipress, USA,
          <fpage>807</fpage>
          -
          <lpage>814</lpage>
          . http://dl.acm.org/citation. cfm?id=
          <volume>3104322</volume>
          .
          <fpage>3104425</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vanhoucke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Iofe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shlens</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wojna</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Rethinking the Inception Architecture for Computer Vision</article-title>
          . In
          <source>2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          .
          <volume>2818</volume>
          -
          <fpage>2826</fpage>
          . https://doi.org/10.1109/CVPR.
          <year>2016</year>
          .308
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Du</surname>
            <given-names>Tran</given-names>
          </string-name>
          , Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and
          <string-name>
            <given-names>Manohar</given-names>
            <surname>Paluri</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Learning Spatiotemporal Features with 3D Convolutional Networks</article-title>
          .
          <source>In Proceedings of the 2015 IEEE International Conference on Computer Vision</source>
          (ICCV)
          <article-title>(ICCV '15)</article-title>
          . IEEE Computer Society, Washington, DC, USA,
          <fpage>4489</fpage>
          -
          <lpage>4497</lpage>
          . https://doi.org/10.1109/ICCV.
          <year>2015</year>
          .510
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>