<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Video Memorability Prediction with Recurrent Neural Networks and Video Titles at the 2018 MediaEval Predicting Media Memorability Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Wensheng Sun</string-name>
          <email>wsun3@mtu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xu Zhang</string-name>
          <email>xzhang21@svsu.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Michigan Technological University</institution>
          ,
          <addr-line>Houghton</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Saginaw Valley State University</institution>
          ,
          <addr-line>Saginaw</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>29</fpage>
      <lpage>31</lpage>
      <abstract>
        <p>This paper describes the approach developed to predict the shortterm and long-term video memorability at the 2018 MediaEval Predicting Media Memorability Task [1]. This approach utilizes the scene semantics derived from the titles of the videos using natural language processing (NLP) techniques and a recurrent neural network (RNN). Compared to using video-based features, this approach has a low computational cost for feature extraction. The performance of the semantic-based methods are compared with those of the aesthetic feature-based methods using support vector regression (ϵ-SVR) and artificial neural network (ANN) models, and the possibility of predicting the highly subjective media memorability with simple features is explored.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Knowledge of the memorability of a video has potential in
advertisement and content recommendation applications. Although highly
subjective, it has been shown that media memorability is
measurable and predictable. As with most other machine learning problems,
ifnding the most relevant features and the right model is the key
to the successful prediction of the media memorability. In [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the
authors investigate possible features that are correlated with
image memorability. It is shown that simple image features such as
color and number of objects show negligible correlation with image
memorability, whereas semantics are significantly correlated with
the memorability.
      </p>
      <p>
        Even though images are reportedly diferent from videos in many
aspects [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], the similarity and connection between images and
videos motivate this work to explore the possible connection
between semantics of a video and its memorability at the 2018
MediaEval Predicting Media Memorability Task [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This hypothesis is
confirmed in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], where the authors show that visual semantic
features provide best prediction among other audio and visual features.
Diferent from [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], an RNN is used to extract the semantics from
the video titles and to predict the video memorabilities in this work.
      </p>
      <p>
        Compared to video-based features, the extraction of video
semantics from its title requires relatively low feature extraction cost.
Moreover, the authors in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] demonstrate a strong connection
between aesthetic features and image interestingness. Thus in this
work, models to predict video memorability using precomputed
aesthetic features [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] provided by the organizer are also developed
and compared with the semantic-based models in performance.
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
    </sec>
    <sec id="sec-3">
      <title>Semantic-based Models</title>
      <p>Run</p>
      <p>Method</p>
      <p>The main model corresponding to run 4 is a three-layer
neural network with a recurrent layer; the structure of the model is
depicted in Fig. 1. After importing the titles, punctuation and
whitespace is removed. The texts are then tokenized to integer sequences
with length equal to 20. Longer titles are truncated and short titles
are padded with zeros. After the preprocessing, 80% of the training
dataset is randomly chosen to train the model, and the remaining
20% is used for model evaluation.</p>
      <p>The tokenized titles are fed to an embedding layer with the
output dimension equal to 15. The embedding matrix is initialized
following uniform distribution. No embedding regularizer is used.
The semantics are extracted by adding a fully connected recurrent
layer with 10 units after the embedding layer. The activation
function for the recurrent layer is hyperbolic tangent. The layer uses
a bias vector, which is initialized as zeros. Initializer for the
kernel weight matrix used for the linear transformation of the inputs
is chosen as “glorot uniform”. Initializer for the recurrent kernel
weight matrix used for the linear transformation of the recurrent
state is set as “orthogonal”. A 10-node fully connected dense layer
follows using rectangular linear unit (ReLU) activation function.
The kernel regularization function used is l1 −l2 regularization with
λ1 = 0.001 and λ2 = 0.004. The initialization scheme is the same
as that of the RNN layer. The last layer is a 2-node dense layer
predicting the short-term and long-term memorability simultaneously,
where a linear activation function is used. This model is trained
using RMSprop optimizer against the mean absolute error (MAE).
The model is trained 10 epochs with batch size equal to 20.</p>
      <p>
        Similar to the model in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the semantics are combined with a
support vector regression (ϵ-SVR) model to generate run 5, whose
structure is also shown in Fig. 1. After the preprocessing stage, the
dimensionality of the tokenized titles is reduced to explain 90%
of the variance through principle component analysis (PCA). The
output is then fed into an ϵ-SVR model. The penalty parameter C of
      </p>
      <p>Preprocessing
RNN
Titles</p>
      <p>Punctuation
removal</p>
      <p>Vectorization</p>
      <p>Embedding</p>
      <p>Layer</p>
      <p>RNN
Layer
(H.T.,10)</p>
      <p>Dense</p>
      <p>Layer
(ReLU,10)</p>
      <p>Dense</p>
      <p>Layer
(Linear,2)</p>
      <p>Run 4
PCA
90%
߳ -SVR</p>
      <p>Run 5
the error is set to be 0.1. The ϵ, which defines a tube within which no
penalty is associated, is equal to 0.01. Radial basis functions are used
as the kernel function. The above hyper parameters are obtained
through a grid search cross-validation using the Spearman’s rank
correlation as the scoring matrix.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Aesthetic Feature-based Models</title>
      <p>
        Details of the models using precomputed aesthetic features [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] are
described in this section. As shown in Fig. 2, run 1 and run 2 are
generated by ϵ-SVR models using aesthetic visual features
aggregated at video level by median and mean methods, respectively.
In both runs, the input features are standardized first, and a PCA
module is applied to reduce the dimensionality of the data to count
95% of the data variance. Radial basis function is chosen in both
runs. The grid search cross-validated best parameters for the eSVR
model are C = 0.01 and ϵ = 0.1.
      </p>
      <p>The evaluation results show that the mean aesthetic features are
more relevant to the video memorability. Thus run 3 is generated
using ANN and mean aesthetic features as illustrated in Fig. 2. The
ANN model consists of three dense layers, the first two layers are
fully connected dense layers with 50 nodes, where ReLU activation
function is used, and l2 regularization is applied. The regularization
penalty constant is set to 0.001. Dropout rates for the first two
layers are equal to 0.1 and 0.5, respectively. The output layer has
two nodes and uses linear activation functions. Mean square error
(MSE) is used as the loss function during the training process, where
the validation data is randomly chosen from the training data within
each epoch. 20 epochs are trained in total with the batch size equal
to 32.
From the returned evaluation results in Table. 1, the following
conclusions can be observed: 1) The model using RNN and
semantics is the best among all the five models. It confirms that the
semantics of the videos are more relevant to both short and
longterm memorability than aesthetic features. Especially for long-term
memorability, the semantic based models outperform the aesthetic
feature-based models unanimously. 2) Without the recurrent layer,
the performance decreases. Thus it can be inferred that
interaction between objects in a video has more impact on the video’s
long-term and short-term memorability than knowing only the
objects. 3) Even though there is certain correlation between short and
long-term memorability as depicted in Fig. 3, results have shown
that short-term memorability is more predictable than long-term
ones since all models score higher in short-term than long-term
memorability. As illustrated in Fig. 3, long-term scores range from
0.2 to 1 and exhibit higher variance than the short-term scores,
which distribute from 0.4 to 1. Thus, one possible reason is that the
long-term memorability is more subjective and depends more on
individual’s memory.</p>
      <p>It is observed that the SVR models using median and mean
aesthetic features have close performance as run 4 in terms of
shortterm memorability prediction. However, the long-term performance
is far worse than run 4. Further investigations are needed to
clarify this. Performance of run 3 is worse than that of run 2, even
though both of them use mean aesthetic features. Possible reasons
are over-fitting and the missing standardization procedure in run 4.
In the future, ensemble methods are expected to further enhance
the prediction accuracy.</p>
      <p>Predicting media memorability task</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Romain</given-names>
            <surname>Cohendet</surname>
          </string-name>
          ,
          <string-name>
            <surname>Claire-Hélène</surname>
            <given-names>Demarty</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ngoc Q. K. Duong</surname>
          </string-name>
          , Mats Sjöberg, Bogdan Ionescu, and
          <string-name>
            <surname>Thanh-Toan Do</surname>
          </string-name>
          .
          <source>MediaEval</source>
          <year>2018</year>
          :
          <article-title>Predicting Media Memorability Task</article-title>
          .
          <source>In Proc. of the MediaEval 2018 Workshop</source>
          , Vol. abs/
          <year>1807</year>
          .01052.
          <fpage>29</fpage>
          -
          <issue>31</issue>
          <year>October</year>
          ,
          <year>2018</year>
          ,
          <string-name>
            <given-names>Sophia</given-names>
            <surname>Antipolis</surname>
          </string-name>
          , France,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Phillip</given-names>
            <surname>Isola</surname>
          </string-name>
          , Jianxiong Xiao, Devi Parikh, Antonio Torralba, and
          <string-name>
            <given-names>Aude</given-names>
            <surname>Oliva</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>What makes a photograph memorable? IEEE transactions on pattern analysis</article-title>
          and
          <source>machine intelligence 36</source>
          ,
          <issue>7</issue>
          (
          <year>2014</year>
          ),
          <fpage>1469</fpage>
          -
          <lpage>1482</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shekhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Singal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kedia</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Shetty</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Show and Recall: Learning What Makes Videos Memorable</article-title>
          .
          <source>In 2017 IEEE International Conference on Computer Vision Workshops (ICCVW)</source>
          .
          <volume>2730</volume>
          -
          <fpage>2739</fpage>
          . https://doi.org/10.1109/ICCVW.
          <year>2017</year>
          .321
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Romain</given-names>
            <surname>Cohendet</surname>
          </string-name>
          , Karthik Yadati,
          <string-name>
            <surname>Ngoc Q.K. Duong</surname>
            , and
            <given-names>ClaireHélène</given-names>
          </string-name>
          <string-name>
            <surname>Demarty</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Annotating, understanding, and predicting long-term video memorability</article-title>
          .
          <source>In Proc. of the ICMR 2018 Workshop</source>
          , Yokohama, Japan, June 11-14.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Gygli</surname>
          </string-name>
          , Helmut Grabner, Hayko Riemenschneider, Fabian Nater, and Luc Van Gool.
          <year>2013</year>
          .
          <article-title>The interestingness of images</article-title>
          .
          <source>In Proceedings of the IEEE International Conference on Computer Vision</source>
          . 1633-
          <fpage>1640</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Andreas</surname>
            <given-names>F Haas</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marine Guibert</surname>
            , Anja Foerschner, Sandi Calhoun, Emma George, Mark Hatay, Elizabeth Dinsdale, Stuart A Sandin, Jennifer E Smith,
            <given-names>Mark JA</given-names>
          </string-name>
          <article-title>Vermeij, and</article-title>
          <string-name>
            <surname>others.</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>Can we measure beauty? Computational evaluation of coral reef aesthetics</article-title>
          .
          <source>PeerJ</source>
          <volume>3</volume>
          (
          <year>2015</year>
          ), pp.
          <fpage>1390</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>