<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>GIBIS at MediaEval 2018: Predicting Media Memorability Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ricardo Manhães Savii</string-name>
          <email>ricardo.savii@dafiti.com.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Samuel Felipe dos Santos</string-name>
          <email>felipe.samuel@unifesp.br</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jurandy Almeida</string-name>
          <email>jurandy.almeida@unifesp.br</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dafiti Group</institution>
          ,
          <country country="BR">Brazil</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>GIBIS Lab, Instituto de Ciência e Tecnologia, Universidade Federal de São Paulo</institution>
          ,
          <country country="BR">Brazil</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Media Memorability</institution>
          ,
          <addr-line>k-NN Regressor, Deep Learning</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>29</fpage>
      <lpage>31</lpage>
      <abstract>
        <p>This paper describes the GIBIS team experience in the Predicting Media Memorability Task at MediaEval 2018. In this task, we were required to develop an approach to predict a score reflecting whether videos are memorable or not, considering short-term memorability and long-term memorability. Our proposal relies on diferent learning strategies: for long-term memorability, we adopted k-NN regressors trained on hand-crafted motion features; and for shortterm memorability, we trained deep learning models.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The Predicting Media Memorability task is part of the MediaEval
2018 Benchmarking Initiative for Multimedia Evaluation. The goal
of this task is to automatically predict a memorability score for
a video reflecting its probability to be remembered. For this, it is
provided a dataset composed of 10,000 short, soundless videos split
into 8,000 videos for the development set and 2,000 videos for the
test set. Also, pre-computed visual features are provided to facilitate
participation. For more details about this task, please, refer to [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        In this paper, we explore two main approaches: (1) for long-term
memorability, an ensemble of ten KNR (k-Nearest Neighbor
Regressor) or SVR (Support Vector Regression) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] trained on the provided
HMP (Histogram of Motion Patterns) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] feature; and (2) for
shortterm memorability, a deep learning model based on 3D convolutions
and 3D pooling layers, known as C3D (Convolution3D) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>OUR APPROACH</title>
      <p>
        The proposed approach exploits diferent strategies for long-term
and short-term memorability. The former relies on hand-crafted
motion features extracted with HMP whereas the latter uses
datadriven features learned with C3D. One limitation of C3D is its
capacity to capture subtle but long-term motion dynamics, as it
requires to break a video into small clips. Unlike C3D, HMP captures
motion dynamics of a video as a whole, and not just parts.
Our proposed approach for the long-term memorability subtask
consists of using the pre-computed HMP feature [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] in conjunction
with two regression algorithms: SVR (Support Vector Regression)
and KNR (k-Nearest Neighbor Regressor).
      </p>
      <p>
        HMP encodes an entire video into a single histogram
representing its overall motion dynamics. From this, we can consider the
HMP vector as a hash identifying each video as a point in a
highdimensional space. This idea of space is the foundation for the use
of the KNR and SVR algorithms [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        For reproducibility, the KNR and SVR implementations used
comes from the scikit-learn python package [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The steps for the
experiment are: split the dev-set at random into ten folds, then train
one regression model on each fold. In this way, we get ten diferent
models. They are used as an ensemble to predict the memorability
over the HMP features of the test-set. The average output is
considered as the final score and we used the 95% confidence interval
as the output confidence.
2.2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Short-Term Approach</title>
      <p>
        For short-term memorability, we use a deep learning model based
on the C3D architecture [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. However, our C3D model has some
diferences from the original C3D model. Here, we include a
multiheaded layer by adding two fully connected layers at the top of
the C3D model. To provide confidence over prediction values, we
implemented a multi-output model, a two-headed model. The heads
are: (1) a regression output (i.e., sigmoid activation) used to predict
the memorability score; and (2) a classification head predicting the
discretized memorability bucket. The short-memorability score was
discretized in 10 buckets and used as classes for prediction. In this
way, the classification head using a softmax activation provides a
confidence value over the responses of the regression head.
      </p>
      <p>The implemented C3D model follows a 3D convolution and
3D max pooling architecture1 and it outputs a fully connected
layer with 2048 neurons. This is the first of three fully connected
layers that feed a multi-head output for regression and classification.
Figure 1 shows the network architecture of our first experiment.</p>
      <p>
        In our second experiment, we used the same C3D model, however,
we consider a two-stream network. For this, a second C3D model
receives as input the optical flow [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The outputs of each C3D
model are concatenated to form the first fully connected layer.
1Our C3D implementation is available at: https://github.com/ricoms/deep_
memorability/blob/master/deep_memorability/trainer2/video_c3d.py
The motivation for this two-stream network is to evaluate if our
C3D model can improve its results with this extra information. For
reproducibility, we used the dense version of optical flow provided
with the OpenCV library2. Figure 2 shows the overall architecture
for this experiment.
      </p>
      <p>
        For both networks, the input data are normalized to real values
in the range [
        <xref ref-type="bibr" rid="ref1">−1, 1</xref>
        ] and resized to 128 × 171 pixels. Also, the C3D
model limits the input to a frame sequence with a predefined length
(typically, 16 frames) and, for this reason, a sequence of 16
consecutive frames from each video was selected at random and used as
input to the network. Optical oflw generates a frame sequence with
one less frame and, for easier the implementation, a last frame filled
with zeros was appended at the end.
      </p>
      <p>For training, a diferent loss function was used for each head:
mean squared logarithmic error for the regression head and
categorical crossentropy for the classification head. Then, a weighted sum
of these individual losses with weights 1.0 and 0.7, respectively, was
computed as the final loss to be minimized by a RMSProp optimizer
with a learning rate of 0.0015.
3</p>
    </sec>
    <sec id="sec-4">
      <title>RESULTS AND ANALYSIS</title>
      <p>We submit three diferent runs configured as shown in Table 1.
We calibrated the long-term memorability subtask through 10-fold
cross-validation on the development data and use a holdout method
with 10% of the development data for validation to calibrate the
short-term memorability subtask. The evaluation metrics are:
Spearman’s rank correlation, Pearson correlation coeficient, and MSE
(Mean Squared Error). The former is the oficial metric for the task.</p>
      <p>Table 2 presents the results for the development and test sets
considering the long-term memorability subtask. In the development
set, we tested diferent regression models: SVR with RBF kernel
and KNR. Also, the values experimented for the parameter k of
KNR were 5, 20, and 30. Notice that KNR performs better than
SVR for Spearman and Pearson metrics and the best results were
achieved by KNR with k = 20. Therefore, we submit one run for the
long-term memorability subtask considering our best result on the
2https://docs.opencv.org/3.4/d7/d8b/tutorial_py_lucas_kanade.html
development set. From it, we achieved a Spearman value of 0.11845
for the test set.</p>
      <p>Table 2: Long-term memorability results.</p>
      <p>Dev. Set
Test Set</p>
      <p>Approach
HMP + SVR kernel RBF
HMP + 5-NN regressor
HMP + 20-NN regressor
HMP + 30-NN regressor
HMP + 20-NN regressor</p>
    </sec>
    <sec id="sec-5">
      <title>DISCUSSION AND OUTLOOK</title>
      <p>It is important to notice that HMP and C3D have an important
diference: HMP captures motion dynamics of a video as whole
whereas C3D is limited to a short window of fixed duration. An
intention of future work is to analyze if features encoding long-term
motion dynamics, like HMP or RNN (Recurrent Neural Network),
are better for predicting video memorability than those capturing
short-term motion dynamics, like C3D or ORB.</p>
      <p>We can think about some reasons for the failure of our deep
learning models. First, we constrain a full length video to a
sequence of 16 consecutive frames. Smarter strategies to capture the
temporal structure of a video, like RNN with LSTM (Long-Short
Term Memory), could led to improvements. Second, we trained our
deep neural networks from scratch. As the training set is rather
small, data augmentation could be used to improve the results.</p>
      <p>Another promising direction is to combine diferent features. For
short memorability, we fused optical flow and video data. Would it
improve results if we fuse video (visual data) and captions (textual
data) provided for the task? Or other visual features, like HMP?</p>
    </sec>
    <sec id="sec-6">
      <title>ACKNOWLEDGMENTS</title>
      <p>We thank the São Paulo Research Foundation - FAPESP (grant
2016/06441-7), the Brazilian National Council for Scientific and
Technological Development - CNPq (grants 423228/2016-1 and
313122/2017-2) and the Brazilian Federal Agency for
Coordination for the Improvement of Higher Education Personnel - CAPES
(grant 1703269) for funding. We gratefully acknowledge the support
of NVIDIA Corporation with the donation of the Titan Xp GPU
used for this research.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. J.</given-names>
            <surname>Leite</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Torres</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Comparison of Video Sequences with Histograms of Motion Patterns</article-title>
          .
          <source>In IEEE International Conference on Image Processing (ICIP'11)</source>
          . Brussels, Belgium,
          <fpage>3673</fpage>
          -
          <lpage>3676</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cohendet</surname>
          </string-name>
          ,
          <string-name>
            <surname>C-H. Demarty</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Duong</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Sjöberg</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <surname>T-T. Do</surname>
            , and
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Rennes</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>MediaEval 2018: Predicting Media Memorability Task</article-title>
          .
          <source>In Proc. of the MediaEval 2018 Workshop</source>
          . Sophia Antipolis, France.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <surname>W-B. Huang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Gan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Ermon</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Gong</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Endto-End Learning of Motion Representation for Video Understanding</article-title>
          .
          <source>In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR'18)</source>
          . Salt Lake City,
          <string-name>
            <surname>UT</surname>
          </string-name>
          , USA,
          <fpage>6016</fpage>
          -
          <lpage>6025</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. VanderPlas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cournapeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perrot</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Duchesnay</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Scikit-learn: Machine Learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          (
          <year>2011</year>
          ),
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. D.</given-names>
            <surname>Bourdev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Torresani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Paluri</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Learning Spatiotemporal Features with 3D Convolutional Networks</article-title>
          .
          <source>In IEEE International Conference on Computer Vision</source>
          (ICCV'15). Santiago, Chile,
          <fpage>4489</fpage>
          -
          <lpage>4497</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>