<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>RUC at MediaEval 2019: Video Memorability Prediction Based on Visual Textual and Concept Related Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shuai Wang</string-name>
          <email>shuaiwang@ruc.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Linli Yao</string-name>
          <email>yaolinliruc@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jieting Chen</string-name>
          <email>jietingchen1208@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qin Jin</string-name>
          <email>qjin@ruc.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Information, Renmin University of China</institution>
          ,
          <addr-line>Beijing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>27</fpage>
      <lpage>29</lpage>
      <abstract>
        <p>Memorability of videos has great values in diferent applications such as education system, advertising design and media recommendation. Memorability automatic prediction can make people's daily life more convenient, and bring companies profit. In this paper, we present our approaches in The Predicting Media Memorability Task at MediaEval 2019 . We explored some visual, textual and artificially designed concept related features in regression models to predict the memorability of videos.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The MediaEval 2019 Predicting Media Memorability Task [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] aims
to find out what type of video is memorable , namely how likely it is
that the video can be remembered after people watching them. This
problem has a wide range of applications such as video retrieval
and recommendation, advertising design and education system.
We explored some visual, textual and artificially designed concept
related features in regression models to predict the memorability
of videos.
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
      <p>Generally, we concentrate on visual features extracted from videos
and textual features drew from given textual metadata. Among
visual features, we consider both the visual information in a frame
and the temporal factors between successive frames. In addition,
we use deep network to extract high-level and semantic feature
representation. Based on each individual extracted feature, we then
do feature normalization. Further, we perform feature fusion to get
better performance. Finally, we consider two simple but eficient
regressors called Support Vector Regression (SVR) and Random
Forest Regression (RFR) to get final memorability scores.</p>
    </sec>
    <sec id="sec-3">
      <title>Base Features</title>
      <p>
        In addition to the eight video special features provided by the
oficial benchmark, we try to extract other new features that may be
related to video memorability. We try to extract high-level
representation of videos with DenseNet[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and ResNet[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] pre-trained on
ImageNet [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], respectively. Detailedly, we extract 11 frames from
each video as input images and the DenseNet169 will output
features with 1664 dimension. Then we combine features of 11 frames
to generate video-level representation in two ways: simply taking
the average, and using Gated Recurrent Unit(GRU)[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] which makes
use of temporal information. The process of ResNet152 is similar
and it outputs 2048 dimension features.
      </p>
      <p>
        The origin title of each video summarizes inclusion objects
and events briefly. We try some popular word embedding
models to get textual features from these captions, including GloVe[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
ConceptNet[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and Bert[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We add the embedding of each word
up and take average of each dimension to obtain the representation
of a whole sentence.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>AMNet</title>
      <p>
        We find that when people watch videos, they do not pay equal
attention to each region in the scene, but first focus on a certain
area, which may change over time. And we learn from Baveye
et al.[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] that still image regions quickly attracting us are closely
related to the highly memorized areas. Therefore, we draw on the
idea and directly apply the AMNet[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] to our task. AMNet is an
end-to-end architecture with a Soft Attention Mechanism and a
Long Short Term Memory (LSTM) recurrent neural network for
memorability score regression. Moreover, AMNet uses transfer
learning and is evaluated on the LaMem datasets, consequently
extending our task’s datasets. And this contributes to predicted
memorability scores scattering in a larger scale, which is much
closer to the distribution of ground truth.
      </p>
      <p>Specifically, we fine-tune AMNet on the dataset of MediaEval
2019, training the long-term and short-term sub-tasks separately.
Considering that AMNet is designed for still images, we extract 11
frames at a uniform time interval for each video as input. As for
prediction, we take the median memorability score of 11 frames
as the final result. As in figure 1, we can visually observe that the
output attention maps of video frames are closely related to the
highly memorable visual contents in the picture.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Concept</title>
      <p>
        Generally, people have a preference for paying attention to diferent
concepts. According to [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], most of the entities could be covered by
7 concepts: animals, building, device, furniture, nature, person, and
vehicle. Among these 7 concepts, animals, person and vehicle are
highly memorable. Inspired by this, we use the 7 concepts to make
analyses on our caption data. We extracted meaningful entities
from the captions by filtering out stop words and keeping nouns.
      </p>
      <sec id="sec-5-1">
        <title>Spearman</title>
      </sec>
      <sec id="sec-5-2">
        <title>Spearman Pearson MSE</title>
      </sec>
      <sec id="sec-5-3">
        <title>Spearman Pearson MSE</title>
      </sec>
      <sec id="sec-5-4">
        <title>Base2</title>
        <p>To find out whether the idea makes sense on our data, we counted
the number of entities belonging to each concept. Then we take the
average of the memorability scores of the videos corresponding to
these concepts. The result is shown in Figure 2, showing that the
preference on concepts also afects the memorability of the video
to some extent.</p>
        <p>Then, with the help of GloVe word vector pre-trained on
Common Crawl data, we calculate the distance between entities and the
above 7 concepts. For each entity, we can get a distance vector with
7 elements which can represent the correlation between the entity
and each concept. For each caption, we take the average of the
distance vectors of all the entities the caption contains, so that we
get a feature vector. Then we apply a random forest regression on
the feature vectors. The Spearman score on long-term memorability
prediction is 0.11. This result, based solely on the textual manual
features shows that the concepts of entities in videos is meaningful
for predicting memorability of the videos.</p>
        <p>
          We made further exploration in this direction. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] claims that
when people focus and memorize, they will pay more attention
to the concepts they are familiar with. Hence we find some
familiar word lists in Wikipedia and pick a list called dolch word [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]
containing 156 concepts after filtering out certain parts of speech.
Specifically, we replace our 7 concepts with these 156 concepts and
generate the feature vectors of each caption. This time we got a
Spearman score 0.15 on the long-term memorability. The result is
promising for us to consider fusing concept features into the entire
model.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>RESULTS</title>
      <p>We split the develop set into two parts, namely the training set and
the validation set. We train and test on these two sets and determine
the final methods according to the performances on validation set,
ifnally the models are trained on the whole develop set and predict
on the oficial test set. The results on validation set and oficial test
set are shown in Table 3 and Table 2 respectively.</p>
      <p>In Table 1, Table 3 and Table 2, "Base1" means the early fusion of
DenseNet169, GloVe and C3D features, while "Base2" additionally
includes ConceptNet. The "Base1" and "Base2" are the best early
fusion strategies on validation set. The ’AM’ is AMNet scores
mentioned above and ’Dist’ denotes the scores from concept distances.
The plus sign means late fusion and we apply a set of weights on
them empirically, which is "Base * 0.9 + AM * 0.1" and "Base * 0.6 +
Dist * 0.4"</p>
      <sec id="sec-6-1">
        <title>Base2 0.196 0.215 0.02</title>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>ANALYSIS AND DISCUSSION</title>
      <p>Based on our previous experience, the deep CNN features and
caption embedding features are the most efective in the memorability
prediction task, such as DenseNet169 and GloVe word embeddings
in our experiments. In addition, we also consider some other
features to study whether there are some complementary points and
pick out two combinations as "Base1" and "Base2". It’s easy to
remember familiar things for us, so we consider there are a fuzzy and
a clear way to represent these things. AMNet can automatically pay
attention to a object or an area that may attract us, and this is like a
fuzzy representation, because it does not show the concept directly.
The clear way is the concept distances which depict the distance
map of the current video. The late fusion of these two methods
and the "base" boost the performaces slightly. We suppose that the
"base" namely CNN features and caption embeddings are stable,
and maybe the caption embeddings have already included some
information about these concepts, so the improvement of results is
not very obvious.
5</p>
    </sec>
    <sec id="sec-8">
      <title>CONCLUSION</title>
      <p>In conclusion, we design a model that uses visual and textual
representations to predict the memorability scores of given videos.
The results show that deep CNN and caption word embeddings
are efective and the attention information from AMNet and
semantic distance extracted from captions can boost the performance
slightly. In the future, we will focus on the concept representation
and semantic representations. Also the interaction of long term and
short term ground-truth is a interesting point to be explored.</p>
    </sec>
    <sec id="sec-9">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was supported by National Key Research and
Development Plan under Grant No. 2016YFB1001202, Research Foundation
of Beijing Municipal Science Technology Commission under Grant
No. Z181100008918002 and National Natural Science Foundation of
China under Grant No.61772535.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Kyunghyun</given-names>
            <surname>Cho</surname>
          </string-name>
          , Bart van Merriënboer,
          <string-name>
            <surname>Caglar Gulcehre</surname>
            , Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and
            <given-names>Yoshua</given-names>
          </string-name>
          <string-name>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation</article-title>
          .
          <source>In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          .
          <article-title>Association for Computational Linguistics</article-title>
          , Doha, Qatar,
          <fpage>1724</fpage>
          -
          <lpage>1734</lpage>
          . https://doi.org/10.3115/v1/
          <fpage>D14</fpage>
          -1179
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Mihai</given-names>
            <surname>Gabriel</surname>
          </string-name>
          <string-name>
            <given-names>Constantin</given-names>
            , Bogdan Ionescu,
            <surname>Claire-Hélène</surname>
          </string-name>
          <string-name>
            <given-names>Demarty</given-names>
            ,
            <surname>Ngoc Q. K. Duong</surname>
          </string-name>
          , Xavier Alameda-Pineda, and
          <string-name>
            <given-names>Mats</given-names>
            <surname>Sjöberg</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Predicting Media Memorability Task at MediaEval 2019</article-title>
          .
          <source>In Proc. of MediaEval 2019 Workshop</source>
          , Sophia Antipolis, France, Oct.
          <volume>27</volume>
          -
          <fpage>29</fpage>
          ,
          <year>2019</year>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Fei-Fei</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>ImageNet: A Large-Scale Hierarchical Image Database</article-title>
          .
          <source>In CVPR09.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Edward</surname>
            <given-names>W</given-names>
          </string-name>
          <string-name>
            <surname>Dolch</surname>
          </string-name>
          .
          <year>1936</year>
          .
          <article-title>A basic sight vocabulary</article-title>
          .
          <source>The Elementary School Journal</source>
          <volume>36</volume>
          ,
          <issue>6</issue>
          (
          <year>1936</year>
          ),
          <fpage>456</fpage>
          -
          <lpage>460</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Rachit</given-names>
            <surname>Dubey</surname>
          </string-name>
          , Joshua Peterson, Aditya Khosla,
          <string-name>
            <surname>Ming-Hsuan Yang</surname>
            , and
            <given-names>Bernard</given-names>
          </string-name>
          <string-name>
            <surname>Ghanem</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>What makes an object memorable?</article-title>
          .
          <source>In Proceedings of the ieee international conference on computer vision</source>
          . 1089-
          <fpage>1097</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Jiri</given-names>
            <surname>Fajtl</surname>
          </string-name>
          , Vasileios Argyriou, Dorothy Monekosso, and
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Remagnino</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Amnet: Memorability estimation with attention</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          .
          <fpage>6363</fpage>
          -
          <lpage>6372</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          .
          <volume>770</volume>
          -
          <fpage>778</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Gao</given-names>
            <surname>Huang</surname>
          </string-name>
          , Zhuang Liu,
          <string-name>
            <surname>Laurens Van Der Maaten</surname>
          </string-name>
          , and
          <string-name>
            <surname>Kilian Q Weinberger</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Densely connected convolutional networks</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          .
          <volume>4700</volume>
          -
          <fpage>4708</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Marvin</given-names>
            <surname>Minsky</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>The Emotion Machine: Commonsense Thinking</article-title>
          .
          <article-title>Artificial Intelligence, and the Future of the Human Mind, Simon &amp; Schuster (</article-title>
          <year>2006</year>
          ),
          <fpage>529</fpage>
          -
          <lpage>551</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Jefrey</surname>
            <given-names>Pennington</given-names>
          </string-name>
          , Richard Socher, and
          <string-name>
            <given-names>Christopher D.</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>GloVe: Global Vectors for Word Representation</article-title>
          .
          <source>In Empirical Methods in Natural Language Processing (EMNLP)</source>
          .
          <volume>1532</volume>
          -
          <fpage>1543</fpage>
          . http: //www.aclweb.org/anthology/D14-1162
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Robert</surname>
            <given-names>Speer</given-names>
          </string-name>
          , Joshua Chin, and
          <string-name>
            <given-names>Catherine</given-names>
            <surname>Havasi</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Conceptnet 5.5: An open multilingual graph of general knowledge</article-title>
          .
          <source>In Thirty-First AAAI Conference on Artificial Intelligence .</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>