=Paper=
{{Paper
|id=Vol-2670/MediaEval_19_paper_33
|storemode=property
|title=RUC at
MediaEval 2019: Video Memorability Prediction Based on Visual Textual and Concept Related
Features
|pdfUrl=https://ceur-ws.org/Vol-2670/MediaEval_19_paper_33.pdf
|volume=Vol-2670
|authors=Shuai Wang,Linli
Yao,Jieting Chen,Qin Jin
|dblpUrl=https://dblp.org/rec/conf/mediaeval/WangYCJ19
}}
==RUC at
MediaEval 2019: Video Memorability Prediction Based on Visual Textual and Concept Related
Features==
RUC at MediaEval 2019: Video Memorability Prediction Based on Visual Textual and Concept Related Features Shuai Wang, Linli Yao, Jieting Chen, Qin Jin School of Information, Renmin University of China, Beijing, China shuaiwang@ruc.edu.cn,yaolinliruc@gmail.com,jietingchen1208@gmail.com,qjin@ruc.edu.cn ABSTRACT Memorability of videos has great values in different applications such as education system, advertising design and media recommen- dation. Memorability automatic prediction can make people’s daily life more convenient, and bring companies profit. In this paper, we present our approaches in The Predicting Media Memorability Task Figure 1: Examples of attention maps for high and low mem- at MediaEval 2019 . We explored some visual, textual and artificially orability images from videos in the dataset of MediaEval designed concept related features in regression models to predict 2019. Long-term scores: left picture 1.0, right picture 0.3. the memorability of videos. Short-term scores: left picture 0.928, right picture 0.898. The origin title of each video summarizes inclusion objects 1 INTRODUCTION and events briefly. We try some popular word embedding mod- The MediaEval 2019 Predicting Media Memorability Task [2] aims els to get textual features from these captions, including GloVe[11], to find out what type of video is memorable , namely how likely it is ConceptNet[12] and Bert[4]. We add the embedding of each word that the video can be remembered after people watching them. This up and take average of each dimension to obtain the representation problem has a wide range of applications such as video retrieval of a whole sentence. and recommendation, advertising design and education system. We explored some visual, textual and artificially designed concept 2.2 AMNet related features in regression models to predict the memorability We find that when people watch videos, they do not pay equal of videos. attention to each region in the scene, but first focus on a certain area, which may change over time. And we learn from Baveye 2 APPROACH et al.[7] that still image regions quickly attracting us are closely Generally, we concentrate on visual features extracted from videos related to the highly memorized areas. Therefore, we draw on the and textual features drew from given textual metadata. Among idea and directly apply the AMNet[7] to our task. AMNet is an visual features, we consider both the visual information in a frame end-to-end architecture with a Soft Attention Mechanism and a and the temporal factors between successive frames. In addition, Long Short Term Memory (LSTM) recurrent neural network for we use deep network to extract high-level and semantic feature memorability score regression. Moreover, AMNet uses transfer representation. Based on each individual extracted feature, we then learning and is evaluated on the LaMem datasets, consequently do feature normalization. Further, we perform feature fusion to get extending our task’s datasets. And this contributes to predicted better performance. Finally, we consider two simple but efficient memorability scores scattering in a larger scale, which is much regressors called Support Vector Regression (SVR) and Random closer to the distribution of ground truth. Forest Regression (RFR) to get final memorability scores. Specifically, we fine-tune AMNet on the dataset of MediaEval 2019, training the long-term and short-term sub-tasks separately. 2.1 Base Features Considering that AMNet is designed for still images, we extract 11 In addition to the eight video special features provided by the offi- frames at a uniform time interval for each video as input. As for cial benchmark, we try to extract other new features that may be prediction, we take the median memorability score of 11 frames related to video memorability. We try to extract high-level repre- as the final result. As in figure 1, we can visually observe that the sentation of videos with DenseNet[9] and ResNet[8] pre-trained on output attention maps of video frames are closely related to the ImageNet [3], respectively. Detailedly, we extract 11 frames from highly memorable visual contents in the picture. each video as input images and the DenseNet169 will output fea- tures with 1664 dimension. Then we combine features of 11 frames 2.3 Concept to generate video-level representation in two ways: simply taking Generally, people have a preference for paying attention to different the average, and using Gated Recurrent Unit(GRU)[1] which makes concepts. According to [6], most of the entities could be covered by use of temporal information. The process of ResNet152 is similar 7 concepts: animals, building, device, furniture, nature, person, and and it outputs 2048 dimension features. vehicle. Among these 7 concepts, animals, person and vehicle are highly memorable. Inspired by this, we use the 7 concepts to make Copyright 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution analyses on our caption data. We extracted meaningful entities 4.0 International (CC BY 4.0). from the captions by filtering out stop words and keeping nouns. MediaEval’19, 27-29 October 2019, Sophia Antipolis, France MediaEval’19, 27-29 October 2019, Sophia Antipolis, France Shuai Wang et al. Table 1: Results of different features for long-term memora- bility on the validation set Base2 AM Dist Base2+AM Base2+Dist Spearman 0.2551 0.2116 0.1534 0.2588 0.2587 Table 2: Results of different features for long-term memora- bility on the official test set Base2 Base2+AM Base1 Base1+AM Base2+Dist Figure 2: The average memorability scores of 7 concepts Spearman 0.196 0.213 0.198 0.216 0.211 based on our caption corpus. Among these 7 concepts, per- Pearson 0.215 0.235 0.216 0.236 0.235 son, animal and nature get higher memorability scores. MSE 0.02 0.07 0.02 0.08 0.07 To find out whether the idea makes sense on our data, we counted Table 3: Results of different features for short-term memo- the number of entities belonging to each concept. Then we take the rability on the official test set average of the memorability scores of the videos corresponding to Base2 Base2+AM Base1 Base1+AM Base2+Dist these concepts. The result is shown in Figure 2, showing that the Spearman 0.436 0.466 0.446 0.472 0.470 preference on concepts also affects the memorability of the video Pearson 0.493 0.520 0.503 0.526 0.523 to some extent. MSE 0.01 0.06 0.01 0.06 0.07 Then, with the help of GloVe word vector pre-trained on Com- mon Crawl data, we calculate the distance between entities and the 4 ANALYSIS AND DISCUSSION above 7 concepts. For each entity, we can get a distance vector with Based on our previous experience, the deep CNN features and cap- 7 elements which can represent the correlation between the entity tion embedding features are the most effective in the memorability and each concept. For each caption, we take the average of the prediction task, such as DenseNet169 and GloVe word embeddings distance vectors of all the entities the caption contains, so that we in our experiments. In addition, we also consider some other fea- get a feature vector. Then we apply a random forest regression on tures to study whether there are some complementary points and the feature vectors. The Spearman score on long-term memorability pick out two combinations as "Base1" and "Base2". It’s easy to re- prediction is 0.11. This result, based solely on the textual manual member familiar things for us, so we consider there are a fuzzy and features shows that the concepts of entities in videos is meaningful a clear way to represent these things. AMNet can automatically pay for predicting memorability of the videos. attention to a object or an area that may attract us, and this is like a We made further exploration in this direction. [10] claims that fuzzy representation, because it does not show the concept directly. when people focus and memorize, they will pay more attention The clear way is the concept distances which depict the distance to the concepts they are familiar with. Hence we find some famil- map of the current video. The late fusion of these two methods iar word lists in Wikipedia and pick a list called dolch word [5] and the "base" boost the performaces slightly. We suppose that the containing 156 concepts after filtering out certain parts of speech. "base" namely CNN features and caption embeddings are stable, Specifically, we replace our 7 concepts with these 156 concepts and and maybe the caption embeddings have already included some generate the feature vectors of each caption. This time we got a information about these concepts, so the improvement of results is Spearman score 0.15 on the long-term memorability. The result is not very obvious. promising for us to consider fusing concept features into the entire model. 5 CONCLUSION In conclusion, we design a model that uses visual and textual rep- 3 RESULTS resentations to predict the memorability scores of given videos. We split the develop set into two parts, namely the training set and The results show that deep CNN and caption word embeddings the validation set. We train and test on these two sets and determine are effective and the attention information from AMNet and se- the final methods according to the performances on validation set, mantic distance extracted from captions can boost the performance finally the models are trained on the whole develop set and predict slightly. In the future, we will focus on the concept representation on the official test set. The results on validation set and official test and semantic representations. Also the interaction of long term and set are shown in Table 3 and Table 2 respectively. short term ground-truth is a interesting point to be explored. In Table 1, Table 3 and Table 2, "Base1" means the early fusion of DenseNet169, GloVe and C3D features, while "Base2" additionally includes ConceptNet. The "Base1" and "Base2" are the best early ACKNOWLEDGMENTS fusion strategies on validation set. The ’AM’ is AMNet scores men- This work was supported by National Key Research and Develop- tioned above and ’Dist’ denotes the scores from concept distances. ment Plan under Grant No. 2016YFB1001202, Research Foundation The plus sign means late fusion and we apply a set of weights on of Beijing Municipal Science Technology Commission under Grant them empirically, which is "Base * 0.9 + AM * 0.1" and "Base * 0.6 + No. Z181100008918002 and National Natural Science Foundation of Dist * 0.4" China under Grant No.61772535. The 2019 Predicting Media Memorability Task MediaEval’19, 27-29 October 2019, Sophia Antipolis, France REFERENCES [1] Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Confer- ence on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 1724–1734. https://doi.org/10.3115/v1/D14-1179 [2] Mihai Gabriel Constantin, Bogdan Ionescu, Claire-Hélène Demarty, Ngoc Q. K. Duong, Xavier Alameda-Pineda, and Mats Sjöberg. 2019. Predicting Media Memorability Task at MediaEval 2019. In Proc. of MediaEval 2019 Workshop, Sophia Antipolis, France, Oct. 27-29, 2019 (2019). [3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. Ima- geNet: A Large-Scale Hierarchical Image Database. In CVPR09. [4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018). [5] Edward W Dolch. 1936. A basic sight vocabulary. The Elementary School Journal 36, 6 (1936), 456–460. [6] Rachit Dubey, Joshua Peterson, Aditya Khosla, Ming-Hsuan Yang, and Bernard Ghanem. 2015. What makes an object memorable?. In Proceedings of the ieee international conference on computer vision. 1089–1097. [7] Jiri Fajtl, Vasileios Argyriou, Dorothy Monekosso, and Paolo Re- magnino. 2018. Amnet: Memorability estimation with attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6363–6372. [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778. [9] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Wein- berger. 2017. Densely connected convolutional networks. In Proceed- ings of the IEEE conference on computer vision and pattern recognition. 4700–4708. [10] Marvin Minsky. 2006. The Emotion Machine: Commonsense Thinking. Artificial Intelligence, and the Future of the Human Mind, Simon & Schuster (2006), 529–551. [11] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP). 1532–1543. http: //www.aclweb.org/anthology/D14-1162 [12] Robert Speer, Joshua Chin, and Catherine Havasi. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. In Thirty-First AAAI Conference on Artificial Intelligence.