=Paper= {{Paper |id=Vol-2670/MediaEval_19_paper_33 |storemode=property |title=RUC at MediaEval 2019: Video Memorability Prediction Based on Visual Textual and Concept Related Features |pdfUrl=https://ceur-ws.org/Vol-2670/MediaEval_19_paper_33.pdf |volume=Vol-2670 |authors=Shuai Wang,Linli Yao,Jieting Chen,Qin Jin |dblpUrl=https://dblp.org/rec/conf/mediaeval/WangYCJ19 }} ==RUC at MediaEval 2019: Video Memorability Prediction Based on Visual Textual and Concept Related Features== https://ceur-ws.org/Vol-2670/MediaEval_19_paper_33.pdf
    RUC at MediaEval 2019: Video Memorability Prediction Based
          on Visual Textual and Concept Related Features
                                                   Shuai Wang, Linli Yao, Jieting Chen, Qin Jin
                                School of Information, Renmin University of China, Beijing, China
                      shuaiwang@ruc.edu.cn,yaolinliruc@gmail.com,jietingchen1208@gmail.com,qjin@ruc.edu.cn

ABSTRACT
Memorability of videos has great values in different applications
such as education system, advertising design and media recommen-
dation. Memorability automatic prediction can make people’s daily
life more convenient, and bring companies profit. In this paper, we
present our approaches in The Predicting Media Memorability Task           Figure 1: Examples of attention maps for high and low mem-
at MediaEval 2019 . We explored some visual, textual and artificially      orability images from videos in the dataset of MediaEval
designed concept related features in regression models to predict          2019. Long-term scores: left picture 1.0, right picture 0.3.
the memorability of videos.                                                Short-term scores: left picture 0.928, right picture 0.898.
                                                                              The origin title of each video summarizes inclusion objects
1     INTRODUCTION                                                         and events briefly. We try some popular word embedding mod-
The MediaEval 2019 Predicting Media Memorability Task [2] aims             els to get textual features from these captions, including GloVe[11],
to find out what type of video is memorable , namely how likely it is      ConceptNet[12] and Bert[4]. We add the embedding of each word
that the video can be remembered after people watching them. This          up and take average of each dimension to obtain the representation
problem has a wide range of applications such as video retrieval           of a whole sentence.
and recommendation, advertising design and education system.
We explored some visual, textual and artificially designed concept         2.2    AMNet
related features in regression models to predict the memorability          We find that when people watch videos, they do not pay equal
of videos.                                                                 attention to each region in the scene, but first focus on a certain
                                                                           area, which may change over time. And we learn from Baveye
2     APPROACH                                                             et al.[7] that still image regions quickly attracting us are closely
Generally, we concentrate on visual features extracted from videos         related to the highly memorized areas. Therefore, we draw on the
and textual features drew from given textual metadata. Among               idea and directly apply the AMNet[7] to our task. AMNet is an
visual features, we consider both the visual information in a frame        end-to-end architecture with a Soft Attention Mechanism and a
and the temporal factors between successive frames. In addition,           Long Short Term Memory (LSTM) recurrent neural network for
we use deep network to extract high-level and semantic feature             memorability score regression. Moreover, AMNet uses transfer
representation. Based on each individual extracted feature, we then        learning and is evaluated on the LaMem datasets, consequently
do feature normalization. Further, we perform feature fusion to get        extending our task’s datasets. And this contributes to predicted
better performance. Finally, we consider two simple but efficient          memorability scores scattering in a larger scale, which is much
regressors called Support Vector Regression (SVR) and Random               closer to the distribution of ground truth.
Forest Regression (RFR) to get final memorability scores.                     Specifically, we fine-tune AMNet on the dataset of MediaEval
                                                                           2019, training the long-term and short-term sub-tasks separately.
2.1     Base Features                                                      Considering that AMNet is designed for still images, we extract 11
In addition to the eight video special features provided by the offi-      frames at a uniform time interval for each video as input. As for
cial benchmark, we try to extract other new features that may be           prediction, we take the median memorability score of 11 frames
related to video memorability. We try to extract high-level repre-         as the final result. As in figure 1, we can visually observe that the
sentation of videos with DenseNet[9] and ResNet[8] pre-trained on          output attention maps of video frames are closely related to the
ImageNet [3], respectively. Detailedly, we extract 11 frames from          highly memorable visual contents in the picture.
each video as input images and the DenseNet169 will output fea-
tures with 1664 dimension. Then we combine features of 11 frames           2.3    Concept
to generate video-level representation in two ways: simply taking          Generally, people have a preference for paying attention to different
the average, and using Gated Recurrent Unit(GRU)[1] which makes            concepts. According to [6], most of the entities could be covered by
use of temporal information. The process of ResNet152 is similar           7 concepts: animals, building, device, furniture, nature, person, and
and it outputs 2048 dimension features.                                    vehicle. Among these 7 concepts, animals, person and vehicle are
                                                                           highly memorable. Inspired by this, we use the 7 concepts to make
Copyright 2019 for this paper by its authors. Use
permitted under Creative Commons License Attribution                       analyses on our caption data. We extracted meaningful entities
4.0 International (CC BY 4.0).                                             from the captions by filtering out stop words and keeping nouns.
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France                                                                      Shuai Wang et al.

                                                                            Table 1: Results of different features for long-term memora-
                                                                            bility on the validation set
                                                                                           Base2      AM        Dist    Base2+AM       Base2+Dist
                                                                             Spearman      0.2551    0.2116    0.1534      0.2588         0.2587
                                                                            Table 2: Results of different features for long-term memora-
                                                                            bility on the official test set
                                                                                           Base2    Base2+AM       Base1    Base1+AM       Base2+Dist

Figure 2: The average memorability scores of 7 concepts                      Spearman      0.196       0.213       0.198       0.216          0.211
based on our caption corpus. Among these 7 concepts, per-                    Pearson       0.215       0.235       0.216       0.236          0.235
son, animal and nature get higher memorability scores.                       MSE           0.02        0.07        0.02        0.08           0.07

To find out whether the idea makes sense on our data, we counted            Table 3: Results of different features for short-term memo-
the number of entities belonging to each concept. Then we take the          rability on the official test set
average of the memorability scores of the videos corresponding to                          Base2    Base2+AM       Base1    Base1+AM       Base2+Dist
these concepts. The result is shown in Figure 2, showing that the            Spearman      0.436       0.466       0.446       0.472          0.470
preference on concepts also affects the memorability of the video            Pearson       0.493       0.520       0.503       0.526          0.523
to some extent.                                                              MSE           0.01        0.06        0.01        0.06           0.07
   Then, with the help of GloVe word vector pre-trained on Com-
mon Crawl data, we calculate the distance between entities and the          4   ANALYSIS AND DISCUSSION
above 7 concepts. For each entity, we can get a distance vector with        Based on our previous experience, the deep CNN features and cap-
7 elements which can represent the correlation between the entity           tion embedding features are the most effective in the memorability
and each concept. For each caption, we take the average of the              prediction task, such as DenseNet169 and GloVe word embeddings
distance vectors of all the entities the caption contains, so that we       in our experiments. In addition, we also consider some other fea-
get a feature vector. Then we apply a random forest regression on           tures to study whether there are some complementary points and
the feature vectors. The Spearman score on long-term memorability           pick out two combinations as "Base1" and "Base2". It’s easy to re-
prediction is 0.11. This result, based solely on the textual manual         member familiar things for us, so we consider there are a fuzzy and
features shows that the concepts of entities in videos is meaningful        a clear way to represent these things. AMNet can automatically pay
for predicting memorability of the videos.                                  attention to a object or an area that may attract us, and this is like a
   We made further exploration in this direction. [10] claims that          fuzzy representation, because it does not show the concept directly.
when people focus and memorize, they will pay more attention                The clear way is the concept distances which depict the distance
to the concepts they are familiar with. Hence we find some famil-           map of the current video. The late fusion of these two methods
iar word lists in Wikipedia and pick a list called dolch word [5]           and the "base" boost the performaces slightly. We suppose that the
containing 156 concepts after filtering out certain parts of speech.        "base" namely CNN features and caption embeddings are stable,
Specifically, we replace our 7 concepts with these 156 concepts and         and maybe the caption embeddings have already included some
generate the feature vectors of each caption. This time we got a            information about these concepts, so the improvement of results is
Spearman score 0.15 on the long-term memorability. The result is            not very obvious.
promising for us to consider fusing concept features into the entire
model.
                                                                            5   CONCLUSION
                                                                            In conclusion, we design a model that uses visual and textual rep-
3    RESULTS                                                                resentations to predict the memorability scores of given videos.
We split the develop set into two parts, namely the training set and        The results show that deep CNN and caption word embeddings
the validation set. We train and test on these two sets and determine       are effective and the attention information from AMNet and se-
the final methods according to the performances on validation set,          mantic distance extracted from captions can boost the performance
finally the models are trained on the whole develop set and predict         slightly. In the future, we will focus on the concept representation
on the official test set. The results on validation set and official test   and semantic representations. Also the interaction of long term and
set are shown in Table 3 and Table 2 respectively.                          short term ground-truth is a interesting point to be explored.
   In Table 1, Table 3 and Table 2, "Base1" means the early fusion of
DenseNet169, GloVe and C3D features, while "Base2" additionally
includes ConceptNet. The "Base1" and "Base2" are the best early             ACKNOWLEDGMENTS
fusion strategies on validation set. The ’AM’ is AMNet scores men-          This work was supported by National Key Research and Develop-
tioned above and ’Dist’ denotes the scores from concept distances.          ment Plan under Grant No. 2016YFB1001202, Research Foundation
The plus sign means late fusion and we apply a set of weights on            of Beijing Municipal Science Technology Commission under Grant
them empirically, which is "Base * 0.9 + AM * 0.1" and "Base * 0.6 +        No. Z181100008918002 and National Natural Science Foundation of
Dist * 0.4"                                                                 China under Grant No.61772535.
The 2019 Predicting Media Memorability Task                                     MediaEval’19, 27-29 October 2019, Sophia Antipolis, France


REFERENCES
 [1] Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry
     Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014.
     Learning Phrase Representations using RNN Encoder–Decoder for
     Statistical Machine Translation. In Proceedings of the 2014 Confer-
     ence on Empirical Methods in Natural Language Processing (EMNLP).
     Association for Computational Linguistics, Doha, Qatar, 1724–1734.
     https://doi.org/10.3115/v1/D14-1179
 [2] Mihai Gabriel Constantin, Bogdan Ionescu, Claire-Hélène Demarty,
     Ngoc Q. K. Duong, Xavier Alameda-Pineda, and Mats Sjöberg. 2019.
     Predicting Media Memorability Task at MediaEval 2019. In Proc. of
     MediaEval 2019 Workshop, Sophia Antipolis, France, Oct. 27-29, 2019
     (2019).
 [3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. Ima-
     geNet: A Large-Scale Hierarchical Image Database. In CVPR09.
 [4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
     2018. BERT: Pre-training of Deep Bidirectional Transformers for
     Language Understanding. arXiv preprint arXiv:1810.04805 (2018).
 [5] Edward W Dolch. 1936. A basic sight vocabulary. The Elementary
     School Journal 36, 6 (1936), 456–460.
 [6] Rachit Dubey, Joshua Peterson, Aditya Khosla, Ming-Hsuan Yang,
     and Bernard Ghanem. 2015. What makes an object memorable?. In
     Proceedings of the ieee international conference on computer vision.
     1089–1097.
 [7] Jiri Fajtl, Vasileios Argyriou, Dorothy Monekosso, and Paolo Re-
     magnino. 2018. Amnet: Memorability estimation with attention. In
     Proceedings of the IEEE Conference on Computer Vision and Pattern
     Recognition. 6363–6372.
 [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep
     residual learning for image recognition. In Proceedings of the IEEE
     conference on computer vision and pattern recognition. 770–778.
 [9] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Wein-
     berger. 2017. Densely connected convolutional networks. In Proceed-
     ings of the IEEE conference on computer vision and pattern recognition.
     4700–4708.
[10] Marvin Minsky. 2006. The Emotion Machine: Commonsense Thinking.
     Artificial Intelligence, and the Future of the Human Mind, Simon &
     Schuster (2006), 529–551.
[11] Jeffrey Pennington, Richard Socher, and Christopher D. Manning.
     2014. GloVe: Global Vectors for Word Representation. In Empirical
     Methods in Natural Language Processing (EMNLP). 1532–1543. http:
     //www.aclweb.org/anthology/D14-1162
[12] Robert Speer, Joshua Chin, and Catherine Havasi. 2017. Conceptnet
     5.5: An open multilingual graph of general knowledge. In Thirty-First
     AAAI Conference on Artificial Intelligence.