=Paper=
{{Paper
|id=Vol-2670/MediaEval_19_paper_39
|storemode=property
|title=Multimodal Deep Features Fusion for Video Memorability Prediction
|pdfUrl=https://ceur-ws.org/Vol-2670/MediaEval_19_paper_39.pdf
|volume=Vol-2670
|authors=Roberto Leyva,Faiyaz Doctor,Alba G. Seco de Herrera,Sohail Sahab
|dblpUrl=https://dblp.org/rec/conf/mediaeval/LeyvaDHS19
}}
==Multimodal Deep Features Fusion for Video Memorability Prediction==
Multimodal Deep Features Fusion For Video Memorability Prediction Roberto Leyva1,2 , Faiyaz Doctor1 , Alba G. Seco de Herrera1 , Sohail Sahab2 1 University of Essex, Colchester, UK 2 Hub Productions, London, UK {r.leyva,fdocto,alba.garcia}@essex.ac.uk,sohail@hub.tv 3×N×M×T ABSTRACT FC7 2k×16/T FV 260k×1 PCA 256×1 This paper describes a multimodal feature fusion approach for pre- LASSO Fusion Score Early 3×N×M FC7 1000×1 PCA 256×1 768×1 LARS 1×1 dicting the short and long term video memorability where the goal CV FC4 300×1 PCA 256×1 to design a system that automatically predicts scores reflecting the moose-calf-in-the-bushes 1×P probability of a video being remembered. The approach performs early fusion of text, image, and video features. Text features are extracted using a Convolutional Neural Network (CNN), an FBRes- Figure 1: Video memorability prediction pipeline via three- Net152 pre-trained on ImageNet is used to extract image features stream media source information. We early fuse text, image and video features are extracted using 3DResNet152 pre-trained on and video features to create a memorability score. Kinetics 400. We use Fisher Vectors to obtain a single vector associ- ated with each video that overcomes the need for using a non-fixed global vector representation for handling temporal information. The fusion approach demonstrates good predictive performance a primary concern in the memorability task. They use Least Abso- and regression superiority in terms of correlation over standard lute Shrinkage and Selection Operator (LASSO) [23], Support Vector features. Regression (SVR), and Elastic Network (ENet) for their experiments. Savii et al. [20] propose using only the video temporal infor- mation employing video features for the memorability task. Here 1 INTRODUCTION the method is passing Convolution 3D (C3D) [24] and Histogram Remembering videos is a key aspect of advertising, entertainment, of Motion Patterns(HMP) [1] features to a Deep Neural Network and recommendation systems [3]. We are more influenced by videos (DNN) where the final score is obtained using a DNN+ k-Nearest that remain fresh in our memory and subsequently share their Neighbour (k-NN) regressor. In similar work, Tran-Van et al. [25] contents with others. Creating memorable video content is cru- proposes a solution to capture the temporal information where cial for generating consumer impact and engaging entertainment they combine Image features IV3 with an Long Short Term Mem- and profitable marketing campaigns. Understanding and predicting ory (LSTM) to produce the memorability score. memorability as a function of video features is therefore important for computational video analysis tasks. In this work, we propose 2 APPROACH a method for video memorability prediction [4] keeping in mind Multi-source feature fusion usually gives improved results over that the videos are not necessarily attractive or interesting. Thus, isolated modeling of features as has been shown in [6, 7, 12, 25, 26]. we explore which features provide better regression results. No Chaudhry et al. [2, 26] models used image, text, and video features assumptions are made on the task’s structure, and we proceed to and achieved better results when fussing them as compared to analyze text, image, and video features in combinations to deter- modelling them individually [22]. However, fusing multiple fea- mine their ability to predict long terms and short term memorability tures from the same information source, e.g., image source, can using different machine learning based regression techniques. Our increase complexity while giving little improvements to the tasks’ findings show that long and short term memorability share the performance [6]. For instance, Joshi et al. [12] propose using the same feature structure giving better accuracy when fusing features Memorability Network [13] along with Hue Saturation and Value of a different type for the short memorability task. These outcomes (HSV) 3D [6], colorfulness [10], aesthetics [8], saliency Net [18] also leave room for future improvements. , C3D [24], and Global Vectors (GloVe) of text features [19]. This The works that precede this study have addressed the memora- approach gives little gains over single-feature source selection. For bility tasks mainly using the provided features or replacing them this reason, we deem appropriate extracting only one feature from [2, 6, 7, 12, 25, 26, 26]. The memorability task can be done using each of the following information sources: text, image, and video. single-source or multi-source feature information to train a re- Secondly, modeling the Spatio-temporal domain via recurrent net- gression model. Gupta et al. [7] propose using images information works may become very computational costly [25]. Because we source via linear highly regularized models to prevent over-fitting are targeting large-scale video analysis, we consider a less complex using the provided features, Residual Network (ResNet) features approach. Thirdly, to generate the memorability score, we explore and Dense Network (DenseNet) features. Over-fitting is potentially linear regularized methods and deep learning models. This con- sideration rests on the assumption that the latter techniques do Copyright 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution not necessarily achieve better generalization, as mentioned in [7]. 4.0 International (CC BY 4.0). Finally, we can improve the provided features’ performance [17]. MediaEval’19, 27-29 October 2019, Sophia Antipolis, France MediaEval’19, 27-29 October 2019, Sophia Antipolis, France EssexHubTV To this end, we use other feature representations following au- types. Also provided are some pre-computed content descriptors. thors [20, 26] using ConceptNet [21], skip-thought [15]. Thereby, Table 1 shows that our approach performs better on STM than on we consider other deep learning approaches for feature extrac- LTM. We experimentally found that the regression model has a tion giving particular importance to the spatio-temporal domain significant impact on the correlation values. This selection requires as [20, 25]. further analysis in terms of features as well. Perhaps unsupervised Our proposed method uses three primary feature modalities (text, models may reveal more about the nature of the tasks. image, and video) for predicting the memorability score, Figure 1 shows the pipeline in detail. Table 1: Memorability task evaluation using Spearman’s Text Features: we use the provided video captions as an input text rank correlation for different models. to a Convolutional Neural Network (TCNN). The text is vectorised via tokenization and word embedding into 100 dimensions to feed Task Run Validation Test the network using the IMDB dataset for sentiment analysis [14]. We use this dataset because of the high accuracy of the network on TCNN/FBRN152/3DRN152/LassoLarsCV 0.5149 0.459 this task ultimately gave us confidence that the model is adequately TCNN/FBRN152/3DRN152/LassoCV 0.4987 0.463 trained and can be trusted as a feature generator. We use the last STM FBRN152/LassoLarsCV 0.4936 0.445 Fully Connected (FC) layer as a feature generator resulting from TCNN/FBRN152/3DRN152/DNN 0.4837 0.436 the concatenation of the text input convolution embedding. This TCNN/DN201/3DRN152/LassoLarsCV 0.5185 0.467 process results in a 300-dimensions feature vector, i.e., 3× embed- TCNN/FBRN152/3DRN152/SVR 0.2394 0.203 ding size. TCNN/FBRN152/3DRN152/LassoCV 0.2321 0.185 Image Features: We extract the middle frame of each video clip LTM TCNN/FBRN152/3DRN152/DNN 0.2104 0.159 and apply FBResNet152 [11] pre-trained on ImageNet. To this end, FBRN152/SVR 0.2491 0.189 we feed the model the middle frame to extract a 1000-dimensions DN152/SVR 0.2612 0.196 feature vector from the last FC layer. We also explored selecting other frames from the sequences without achieving better correla- tion values. Video Features: To extract video features, we use 3DResNet152 [9] 4 DISCUSSION AND OUTLOOK pre-trained on Kinetics-400. We feed the video sequence to retrieve From Table 1, we can see that the best regression model is not the a feature vector for every 16 frames producing a 2048-dimensions same for both tasks. For the STM task, LassoLarsCV achieves the feature vector. Although for this particular case we may have fixed- best results while SVR for the LTM task, respectively. Although it length video clips, in practice the number of frames is not fixed is not the same regression model, we achieve the best correlation and stacking the produced features may become very computa- results for the memorability tasks when fusing all three types of tionally complex. Inspired by the work of Girdhar et al. [5] using features. It is worth noticing that image-based features achieve Vector of Locally Aggregated Descriptors (VLAD) vectors for ac- the second-best results. Regarding the frame selection criterion, tion recognition, we follow a similar approach using Fisher Vectors i.e., the middle frame, we observed no significant difference by (FV) to address this problem. The technique then creates a single selecting other frames in the Spearman’s rank correlation. This feature vector for each video sequence using Fisher Vectors. The aspect may be linked to the short length of the videos. We can method is to generate a Gaussian Mixture Model (GMM) model quickly inspect that there is a strong visual relationship between the from the 16-frame collection features and project them into a high first and the last frame. Perhaps longer sequences may require more dimensional space via the soft assignment. As the resulting fea- elaborate temporal analysis. Thus, for practical purposes, we prefer ture space is considerably high, we reduced the dimensions via to incorporate specific video-designed features. We also verified the Principal Component Analysis (PCA) following an FV-GMM-PCA PCA effectiveness before the early fusion and by individual feature fashion [16]. This last step provides a single feature vector for each selection. We observed an improvement c.a. 4-7% in Spearman’s video sequence capturing the motion information from the clips. rank correlation, thus it is a good practice to project the features into Feature Fusion: We combine the text, image and video features a lower dimensional space before feed the regression model. The via early fusion. Prior to this step, we reduce the features’ dimen- proposed method enables us to capture the memorability associated sionality using PCA with 256 components aiming for better feature with videos comprising multimedia features. With this in mind, it representation. The vectors are then stacked as 3 × 256 = 768 and is possible to create models for similar tasks in video content for feed into the regression model, as Figure 1 illustrates. The last step other computer vision applications. The memorably test, then, can is to perform the regression using a regularized method. To this end, extrapolate multimedia analysis for other case studies, e.g. video we used LassoLarsCV [23] in the pursuit of cross folding that gives summarization where the scores can be treated as features weights, the best regression parameters for the final model automatically. where, naturally, the features are not necessarily visual. ACKNOWLEDGMENTS 3 RESULTS AND ANALYSIS This study has been funded through an Innovate UK Knowledge The memorability dataset comprises 10000 short soundless videos Transfer Partnership between Hub Productions Limited and the split into 8000 videos for the development set and 2000 videos for School of Computer Science & Electronic Engineering, University the test set [4]. The videos are varied and contain different scene of Essex, Partnership No: 11071. The 2019 Predicting Media Memorability Task MediaEval’19, 27-29 October 2019, Sophia Antipolis, France REFERENCES [18] Junting Pan, Elisa Sayrol, Xavier Giro-i Nieto, Kevin McGuinness, and [1] Jurandy Almeida, Neucimar J Leite, and Ricardo da S Torres. 2011. Noel E O’Connor. 2016. Shallow and deep convolutional networks for Comparison of video sequences with histograms of motion patterns. saliency prediction. In Proceedings of the IEEE Conference on Computer In 2011 18th IEEE International Conference on Image Processing. IEEE, Vision and Pattern Recognition. 598–606. 3673–3676. [19] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. [2] Ritwick Chaudhry, Manoj Kilaru, and Sumit Shekhar. 2018. Show and 2014. GloVe: Global Vectors for Word Representation. In Empirical Recall@ MediaEval 2018 ViMemNet: Predicting Video Memorability. Methods in Natural Language Processing (EMNLP). 1532–1543. http: Group 1 (2018), G1. //www.aclweb.org/anthology/D14-1162 [3] Romain Cohendet, Claire-Hélène Demarty, Ngoc Q. K. Duong, [20] Ricardo Manhães Savii, Samuel Felipe dos Santos, and Jurandy Mats Sjöberg, Bogdan Ionescu, Thanh-Toan Do, and France Rennes. Almeida. 2018. GIBIS at MediaEval 2018: Predicting Media Mem- 2018. MediaEval 2018: Predicting Media Memorability Task. CoRR orability Task.. In MediaEval. abs/1807.01052 (2018). arXiv:1807.01052 http://arxiv.org/abs/1807. [21] Robert Speer, Joshua Chin, and Catherine Havasi. 2017. Conceptnet 01052 5.5: An open multilingual graph of general knowledge. In Thirty-First [4] Mihai Gabriel Constantin, Bogdan Ionescu, Claire-Hélène Demarty, AAAI Conference on Artificial Intelligence. Ngoc Q. K. Duong, Xavier Alameda-Pineda, and Mats Sjöberg. 2019. [22] Wensheng Sun and Xu Zhang. 2018. Video Memorability Prediction Predicting Media Memorability Task at MediaEval 2019, Sophia An- with Recurrent Neural Networks and Video Titles at the 2018 Media- tipolis, France, Oct. 27-29, 2019(2019). Proc. of MediaEval 2019 Work- Eval Predicting Media Memorability Task.. In MediaEval. shop (2019). [23] Robert Tibshirani. 1996. Regression shrinkage and selection via the [5] Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, and Bryan lasso. Journal of the Royal Statistical Society: Series B (Methodological) Russell. 2017. Actionvlad: Learning spatio-temporal aggregation for 58, 1 (1996), 267–288. action classification. In Proceedings of the IEEE Conference on Computer [24] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Vision and Pattern Recognition. 971–980. Manohar Paluri. 2015. Learning spatiotemporal features with 3d con- [6] Ankit Goyal, Naveen Kumar, Tanaya Guha, and Shrikanth S volutional networks. In Proceedings of the IEEE international conference Narayanan. 2016. A multimodal mixture-of-experts model for dynamic on computer vision. 4489–4497. emotion prediction in movies. In 2016 IEEE International Conference [25] Duy-Tue Tran-Van, Le-Vu Tran, and Minh-Triet Tran. 2018. Predicting on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2822–2826. Media Memorability Using Deep Features and Recurrent Network. In [7] Rohit Gupta and Kush Motwani. 2018. Linear Models for Video Memo- MediaEval. rability Prediction Using Visual and Semantic Features.. In MediaEval. [26] Shuai Wang, Weiying Wang, Shizhe Chen, and Qin Jin. 2018. RUC at [8] Michael Gygli, Helmut Grabner, Hayko Riemenschneider, Fabian MediaEval 2018: Visual and Textual Features Exploration for Predict- Nater, and Luc Van Gool. 2013. The interestingness of images. In ing Media Memorability.. In MediaEval. Proceedings of the IEEE International Conference on Computer Vision. 1633–1640. [9] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can Spa- tiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 6546–6555. [10] David Hasler and Sabine E Suesstrunk. 2003. Measuring colorfulness in natural images. In Human vision and electronic imaging VIII, Vol. 5007. International Society for Optics and Photonics, 87–95. [11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778. [12] Tanmayee Joshi, Sarath Sivaprasad, Savita Bhat, and Niranjan Pedanekar. 2018. Multimodal Approach to Predicting Media Memora- bility.. In MediaEval. [13] Aditya Khosla, Akhil S Raju, Antonio Torralba, and Aude Oliva. 2015. Understanding and predicting image memorability at a large scale. In Proceedings of the IEEE International Conference on Computer Vision. 2390–2398. [14] Yoon Kim. 2014. Convolutional neural networks for sentence classifi- cation. arXiv preprint arXiv:1408.5882 (2014). [15] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in neural information processing systems. 3294– 3302. [16] R. Leyva, V. Sanchez, and C. Li. 2019. Compact and Low-Complexity Binary Feature Descriptor and Fisher Vectors for Video Analytics. IEEE Transactions on Image Processing 28, 12 (Dec 2019), 6169–6184. https://doi.org/10.1109/TIP.2019.2922826 [17] Yang Liu, Zhonglei Gu, and Tobey H Ko. 2018. Learning Memora- bility Preserving Subspace for Predicting Media Memorability.. In MediaEval.