RUC at MediaEval 2018: Visual and Textual Features Exploration
             for Predicting Media Memorability
                                              Shuai Wang, Weiying Wang, Shizhe Chen, Qin Jin
                                              Renmin University of China, Beijing, China
                              shuaiwang@ruc.edu.cn,wy.wang@ruc.edu.cn,cszhe1@ruc.edu.cn,qjin@ruc.edu.cn

ABSTRACT
Predicting the memorability of videos has great values in various
applications including content recommendation, advertisement de-
sign and so on, which can bring convenience to people in everyday
life, and profit to companies. In this paper, we present our methods
in the 2018 Predicting Media Memorability Task. We explored some
deeply-learned visual features and textual features in regression
models to predict the memorability of videos.

1     INTRODUCTION
The MediaEval 2018 Predicting Media Memorability Task [4] aims
to predict what kind of media is memorable for people, which
has a wide range of applications such as video retrieval, video
recommendation, advertisement design and education system. We                        Figure 1: Two strategies of late fusion
explored visual and textual representation for videos and built a
regression model which can calculate a memorability score for a         We try the word embedding GloVe [8] as the textual feature. We
given video.                                                            combine the embedding of each word to generate the representa-
                                                                        tion of sentences in different ways. Firstly, we simply add them up
2 APPROACH                                                              and take average of each dimension. Secondly, we take smooth IDF
                                                                        [2] as the weight for each word. Thirdly, we try the pre-trained
2.1 Framework                                                           skip-thought model [7]. And fourthly we also try ConceptNet [9].
In general, we utilize a regressor to predict the memorability score    Through these four methods we can obtain different types of video-
of each video and consider late fusion to combine different features.   level representations.
Two kinds of fusion strategies are utilized, namely score average          For visual features, we consider some deeply-learned represen-
and second-layer regression.                                            tations and aesthetic descriptors as our features, including C3D [5],
   Our system framework is shown in Figure 1. We firstly run re-        HMP [1], I3D [3] and aesthetic [6]. The C3D, HMP and aesthetic
gressions to get videos’ memorability scores using different single     are officially provided in this task. Further more, we extract the
features. In order to fuse multiple features, two strategies are con-   I3D-RGB feature, which is obtained from the penultimate layer in
sidered and shown in Figure 1. For the score average strategy, we       RGB branch in I3D.
average the scores of different types of features from the same            Additionally, we add some label information. If we are solving
video, and the obtained score is the final memorability score of this   the long-term task, we firstly train a model with short-term labels
video. For the second-layer regression, we concatenate the scores       and then use this model to predict the short-term scores of the test
of different features from the same video as second-layer features,     set. Then we transform the short labels of both train and test sets
and the obtained second-layer features are fed into a second-layer      into 10 dimensional one-hot vectors. The 10 buckets of a one-hot
regressor, which will predict the final memorability scores.            vector denote the range between 0 to 1 with step 0.1. If a label is in
                                                                        the range of a bucket, e.g. the label is 0.56 and it is in the range of 0.5
2.2     Features                                                        to 0.6, we set the value of this bucket as 1 and the rest of buckets are
The videos are soundless, so we focus on visual and textual features,   set to 0. And then we add them to the end of each text feature. For
especially some high-level and semantic features.                       short-term task, we map the long-term labels into one-hot vectors
    The captions of videos are short with only a few words. We argue    and use them in the same way mentioned above.
that people may be impressed by some particular objects or their
combinations. The meanings of each word should be embedded into
the representations of sentences for the memorability prediction.       3 EXPERIMENTS AND ANALYSIS
    A pre-trained word embedding contains a large amount of seman-      3.1 Experimental Setup
tic information, contributing to encoding the meaning of sentences.
                                                                        The development set and the test set contain 8000 and 2000 videos
Copyright held by the owner/author(s).                                  respectively. We firstly rank the videos by their memorability scores
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France              and sample videos with a constant step value of 4. Finally we split
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                                                               Shuai Wang et al.


                                                                       descriptions about the elements in the videos. If a specific object
                                                                       is depicted by a word, the word embeddings can describe the re-
                                                                       lations of this object and others in the whole environment. The
                                                                       visual features may contain some details of regions but not that
                                                                       intuitive. If there is no caption information, object detection and
                                                                       classification techniques may offer more supports.
                                                                          Table 1 and Table 2 show the results on the test set, m and s
                                                                       means the score average and the second-layer regression strategy
                                                                       respectively. all means fusing all features, namely visual, textual
                                                                       and labels. visual denotes fusing visual representations and text is
Figure 2: Results of different features for long-term memo-
                                                                       the fusion of all word embedding features. required means using
rability on the local test set
                                                                       average strategy and not using the label information. We used aver-
                                                                       age strategy in required runs because average strategy performed
                                                                       generally better than second-layer regression on the local test set.
                                                                       We can notice that the required runs in long-term and short-term
                                                                       task both have the best performances. Label information helps little
                                                                       on local test set and does not work in official test set. We consider
                                                                       that maybe mapping labels into one-hot vectors is not a proper
                                                                       way to fully utilize the label information and it is worthy to find a
                                                                       proper representation format of the labels or a fusion method with
Figure 3: Results of different features for short-term memo-           other features.
rability on the local test set                                            We pick out a number of videos for analysis and we find that
                                                                       some of them depict close-ups of objects or regions, while some
the development set into 2 parts, namely 6000 videos in train set
                                                                       of them show overall scenes such as natural landscape, stories of
and 2000 videos in local test set.
                                                                       some characters.
   We simply consider two types of regressors, namely Support
                                                                          We draw 3 conclusions after viewing these videos and their
Vector Regression (SVR) and Random Forest Regression (RFR). The
                                                                       labels.
parameters were determined by grid searches. The Penalty param-
eter C in SVR is searched from 0.125 to 32. The parameters of              (1) The videos with low short-term labels usually have low
n_estimators and max_depth are searched in the range [100, 1000]               long-term labels.
with step 100 and [2, 10] with step 2 respectively. The I3D model is       (2) The videos with high short-term labels and low long-term
pre-trained on ImageNet and Kinetics.                                          labels usually depict some close-ups.
                                                                           (3) There are few numbers of videos with low short-term labels
Table 1: Results of different features for long-term memora-                   and high long-term labels. These videos generally have
bility on the test set                                                         open and wide scenes.
                                                                           We find that it is difficult to predict the memorability of the
                all-m    visual-m    text-m    all-s   required
                                                                       videos in the second and third situations.
   Spearman     0.2374    0.1875     0.2352   0.2404     0.2404            In sum, we consider that if a video is memorable in a long term,
   Pearson      0.2584    0.2072     0.2565   0.2621     0.2621        it is also memorable in a short term generally. Conversely, videos
   MSE          0.0197    0.0206     0.0198   0.0199     0.0200        with high short-term labels cannot determine the long-term memo-
                                                                       rability.

Table 2: Results of different features for short-term memo-            4   CONCLUSION
rability on the test set                                               In conclusion, we explored visual and textual representations for
                all-m    visual-m    text-m    all-s   required        videos and built a regression model which can calculate a mem-
                                                                       orability score for a given video. The results show that textual
   Spearman     0.4464    0.3547     0.4383   0.4483     0.4484        representations perform better than visual features. In the future,
   Pearson      0.4957    0.3675     0.4881   0.4961     0.4961        we will focus on the visual semantic representations and object
   MSE          0.0075    0.0108     0.0065   0.0080     0.0082        detection related works to find more interesting methods to pre-
                                                                       dict memorability of videos. And how to use label information is
                                                                       another interesting point to be explored.
3.2    Results and Analysis
The results of each single feature for long-term and short-term        ACKNOWLEDGMENTS
memorability prediction are printed in Figure 2 and Figure 3 respec-   This work is supported by National Key Research and Development
tively. As shown in Figure 2 and Figure 3, textual representations     Plan under Grant No. 2016YFB1001202. This work is partially sup-
are on the same level and textual features perform better than vi-     ported by National Natural Science Foundation of China (Grant No.
sual representations. We think that the captions contain more clear    61772535).
Predicting Media Memorability Task                                             MediaEval’18, 29-31 October 2018, Sophia Antipolis, France


REFERENCES
[1] Jurandy Almeida, Neucimar J. Leite, and Ricardo Da S. Torres. 2011.
    Comparison of video sequences with histograms of motion patterns.
    In IEEE International Conference on Image Processing. 3673–3676.
[2] Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. A simple but
    tough-to-beat baseline for sentence embeddings. In ICLR.
[3] J. Carreira and A. Zisserman. 2017. Quo Vadis, Action Recognition?
    A New Model and the Kinetics Dataset. In 2017 IEEE Conference on
    Computer Vision and Pattern Recognition (CVPR), Vol. 00. 4724–4733.
    https://doi.org/10.1109/CVPR.2017.502
[4] Romain Cohendet, Claire-Hélène Demarty, Ngoc Q. K. Duong, Mats
    Sjöberg, Bogdan Ionescu, Thanh-Toan Do, and France Rennes. Media-
    Eval 2018: Predicting Media Memorability Task. In Proc. of the Media-
    Eval 2018 Workshop, 29-31 October 2018, Sophia Antipolis, France, 2018.
[5] Tran Du, Lubomir Bourdev, Rob Fergus, and Lorenzo Torresani. 2015.
    Learning Spatiotemporal Features with 3D Convolutional Networks.
    In IEEE International Conference on Computer Vision. 4489–4497.
[6] Andreas F. Haas, Marine Guibert, Anja Foerschner, Tim Co, Sandi Cal-
    houn, Emma George, Mark Hatay, Elizabeth Dinsdale, Stuart A. Sandin,
    and Jennifer E. Smith. 2015. Can we measure beauty? Computational
    evaluation of coral reef aesthetics. Peerj 3, 12 (2015), e1390.
[7] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel,
    Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-
    Thought Vectors. In Advances in Neural Information Processing Systems
    28. Curran Associates, Inc., 3294–3302.
[8] Jeffrey Pennington, Richard Socher, and Christopher D. Manning.
    2014. GloVe: Global Vectors for Word Representation. In Empirical
    Methods in Natural Language Processing (EMNLP). 1532–1543. http:
    //www.aclweb.org/anthology/D14-1162
[9] Robert Speer, Joshua Chin, and Catherine Havasi. 2017. ConceptNet
    5.5: An Open Multilingual Graph of General Knowledge. In Proceedings
    of the Thirty-First AAAI Conference on Artificial Intelligence, February
    4-9, 2017, San Francisco, California, USA. 4444–4451. http://aaai.org/
    ocs/index.php/AAAI/AAAI17/paper/view/14972