=Paper=
{{Paper
|id=Vol-3181/paper18
|storemode=property
|title=Cross-modal Interaction for Video Memorability Prediction
|pdfUrl=https://ceur-ws.org/Vol-3181/paper18.pdf
|volume=Vol-3181
|authors=Youwei Lu,Xiaoyu Wu
|dblpUrl=https://dblp.org/rec/conf/mediaeval/LuW21
}}
==Cross-modal Interaction for Video Memorability Prediction==
Cross-modal Interaction for Video Memorability Prediction Youwei Lu1 , Xiaoyu Wu1 1 Communication University of China, China wowanglenageta@sina.com ABSTRACT try to introduce the approach of cross-modal interaction methods It is important to select memorable videos from the huge amount of to the field of video memorability prediction. videos, which can serve other fields, such as video summary, movie production, etc. The Predicting Media Memorability task in Media- 3 APPROACH Eval2021 focuses on predicting how well a video is remembered. As we have previously described, visual, textual, and audio infor- In this paper, we use a text-guided visual cross-modal guidance mation play an important role in the video memorability prediction approach for the video memorability prediction task. Based on task. We therefore carefully considered the feature extraction steps this, we use a late fusion approach to fuse features from multiple for each modality. At the same time, we argued that since the text modalities and predict the final video memorability scores. was manually annotated based on the video content, there was semantic consistency between the textual and visual content, and since previous studies have shown that textual information played 1 INTRODUCTION a role in memorability prediction tasks [3, 18], textual features The image memorability task is already a relatively mature field, were used to guide the representation of visual features, and the and much work has been proposed to study it [1, 8–10]. However, two modal features were interacted. After obtaining the features the video memorability prediction task is a brand new task from an from each of the above three modalities, they were passed through artificial intelligence perspective. For images, people may memorize several MLP structures and predicted their respective video memo- a certain region in the image, which leads to high memorability rability scores. Finally, we used an adaptive score fusion strategy scores. For videos, people may memorize certain frames in a video to fuse the scores of the three modalities. and video memorability prediction is a more complex and difficult task. The Predicting Media Memorability task in the MediaEval 3.1 Visual Feature 2021 workshop [11] is designed for this purpose, with the aim of The 3D and 2D convolutional neural networks each have their own investigating how to assess better the degree to which a video advantages when dealing with video contents. The 3D convolu- gives a moment of memory. Video memorability scores are used to tional neural network takes into account the temporal features of measure this metric. Over the past two years, work has been done the video, while the 2D convolutional neural network has a smaller on the video memorability prediction task in the 2019 [14, 19] and number of parameters. We use a 3D convolutional neural network, 2020 [12, 13] editions of the task, where we looked at and considered SlowFast [5], to extract features from the video as Global-level fea- the advantages and disadvantages of other methods, and finally tures. We also use a ResNet-101 network [6] to extract features we proposed our own method for predicting video memorability from the video frames. For each input video, we sample 8 frames scores. evenly. These video frame features are fed into the GRU network [2] to solve the timing-independent problem, and the GRU network 2 RELATED WORK outputs the features as Temporal-aware level features. Afterwards, There are multiple attributes in videos, such as vision and audio, these features are fed into a 1D convolutional neural network with which play important roles in video memorability prediction. Re- different convolutional kernel sizes 2,3,4,5 to sense visual features searchers have used different methods to extract the features of of different local sizes, and the output of the 1D convolutional neu- multiple modalities to obtain a good feature representation. For ral network is used as the Local level features. We splice the Global, example, the authors in [19] tried to extract features of video frames Temporal-aware, and Local level features as visual features. using 2D convolution method, Inception-V3 and used them to com- pose features of the whole video. Authors in [20] tried to extrat 3.2 Textual Feature textual features with Glove model [17], which is a common model We used the Bert model [4] to extract the textual features of the used in the NLP field. Researchers in [12] used a VGGish model video. For each text, we first perform a word separation operation [12] to extract audio features. and prefix each text with a [CLS] token. The features corresponding Cross-modal interaction approaches are widely used in the field to the last layer of [CLS] token in Bert was used as features for the of computer vision. For example, in [15], textual features are used whole text. For videos with multiple texts, we average the features to enhance the representation of visual features in the image cap- of multiple texts as the textual features corresponding to the video tioning task and this is achieved with good results. We therefore because of the similarity of these texts. Copyright 2021 for this paper by its authors. Use permitted under Creative Commons 3.3 Audio Feature License Attribution 4.0 International (CC BY 4.0). MediaEval’21, December 13-15 2021, Online We used the VGGish model [7] to extract audio features. First, we cut each video into segments without overlapping in 0.96s, and each MediaEval’21, December 13-15 2021, Online Lu et al. segment was fed into the VGGish network and a 128-D vector was Table 1: Results of our method on TRECVid2021 Dataset generated. We fed this vector into an MLP structure and predicted validation set and test set the video memorability score of the segment. We take the median score of these segments as the audio stream video memorability Run test set (RC) validation set (RC) prediction score for the video. short-term w/(dev) 0.113 0.330 short-term w/o (dev) 0.123 0.432 3.4 Cross-modal Interaction normalized short-term w/(dev) 0.106 0.296 With the visual and textual features already extracted above, we normalized short-term w/o(dev) 0.132 0.462 used the textual features to interact with the visual features. For long-term w/(dev) 0.11 0.331 the visual features extracted above, we first cut them into M=8 long-term w/o(dev) 0.037 0.298 segments and mapped the visual features and textual features into the same semantic space. Afterwards, the mapped textual features Table 2: Results of our method on Memento10k Dataset vali- and each segment of visual features were integrated to calculate dation set and test set the weight of each segment of visual features. We used this weight to weight and sum the cut M-segment visual features to obtain the interacted features. Through this interaction, the visual features Run test set (RC) validation set (RC) exploited the semantic consistency with the textual features to short-term 0.628 0.642 enhance the expressiveness of their own features. normalized short-term 0.649 0.655 3.5 Score Fusion this phonomenon. Additionally, in score fusion stage, visual feature We trained simple MLP networks using visual, textual, and audio occupies the greatest weight and textual feature is scondary to it. features separately as regressors for predicting the video memora- Table 2 shows the results of our method on the Memento10k bility scores of the respective modalities. MLP network is composed dataset. When training with the Memento10k dataset, we trained of several fully connected layers and non-linear activation func- our model using the officially published training/validation set par- tions. Afterwards, an adaptive weight assignment strategy was used titioning method. We should also explain that we did not use audio to fuse the three scores. We varied the weights of each modality features when training the Memento10k dataset, partly because score in steps of 0.05, but ensured that the total weight sums to 1. some of the videos lack audio, and partly because in [16] the au- In this way, we fused the three scores and predicted the final video thors did not use audio features, so we did not use audio features memorability score. either. Our model achieves better performance on the Memento10k dataset, and we speculate that the reason for this is that more data 4 RESULTS AND ANALYSIS allows for better training of the model and mitigates the effects of In this section, we describe specifically how we used the TRECVid overfitting. and Memento10k dataset in our experiments and present the results in Table 1 and Table 2 below. And this is followed by a brief analysis 5 DISCUSSION AND OUTLOOK of the results of the experiments. In this competition, we first extracted features from multiple modal- Table 1 shows the experimental results of our method on TRECVid ities, then we used cross-modal interaction to enhance the repre- 2021. w/(dev) in the table means that the development set was used, sentation of visual features, and finally we used late fusion to fuse while w/o(dev) means that the development set was not used. This the video memorability scores predicted by multiple modalities to is because the development set was not officially released at the obtain the final video memorability scores. In addition to this, we beginning of the competition, so we only used the training set to observed that optical flow was used to predict video memorability train the model. When the development set was not used, we di- scores in multiple methods, which is one of our future research vided the training set of 590 videos into 479 as the training set and directions. However, as optical flow is time-consuming and labour- 111 as the validation set to train our model. When the development intensive, we did not use optical flow features in this experiment set was used, we considered it unreasonable to use only 590 videos for the time being. as the training set and more than 1000 videos as the validation set, considering that the development set contains nearly 1000 videos. ACKNOWLEDGMENTS So we mixed the training set and development set together and This work is supported by National Natural Science Foundation of divided the data set into training and validation sets at a ratio of China (No. 61801441, No. 61701277,No. 61771288), National Key R 0.8/0.2. We believe that more data would be beneficial to the model. & D plan of the 13th Five-Year plan (No. 2017YFC0821601), cross We were surprised to find that when training a short-term video media intelligence special fund of Beijing National Research Center memorability prediction model, the model without the develop- for Information Science and Technology ( No. BNR2019TD01022 ment set achieved better performance, both in terms of raw and ), discipline construction project of “Beijing top-notch” discipline normalized scores, while when training a long-term video mem- (Internet information of Communication University of China) and orability prediction model, using the development set improved in part by the State Key Laboratory of Media Convergence and the performance significantly. Now we do not know the reason for Communication, Communication University of China. Predicting Media Memorability MediaEval’21, December 13-15 2021, Online REFERENCES [16] Anelise Newman, Camilo Fosco, Vincent Casser, Allen Lee, Barry Mc- [1] Erdem Akagunduz, Adrian G Bors, and Karla K Evans. 2019. Defining Namara, and Aude Oliva. 2020. Multimodal memorability: Modeling image memorability using the visual memory schema. IEEE Trans. effects of semantics and decay on video memorability. In Computer Pattern Anal. Mach. Intell. 42, 9 (2019), 2165–2178. Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August [2] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry 23–28, 2020, Proceedings, Part XVI 16. Springer, 223–240. Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. [17] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Learning phrase representations using RNN encoder-decoder for sta- Glove: Global vectors for word representation. In Proc. 2014 conference tistical machine translation. In Proceedings of the Empiricial Methods on empirical methods in natural language processing (EMNLP). 1532– in Natural Language Processing (EMNLP 2014). 1543. [3] Romain Cohendet, Karthik Yadati, Ngoc QK Duong, and Claire-Hélène [18] Sumit Shekhar, Dhruv Singal, Harvineet Singh, Manav Kedia, and Demarty. 2018. Annotating, understanding, and predicƒting long-term Akhil Shetty. 2017. Show and recall: Learning what makes videos video memorability. In Proc. 2018 ACM on International Conference on memorable. In Proc. IEEE International Conference on Computer Vision Multimedia Retrieval. 178–186. Workshops. 2730–2739. [4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. [19] Le-Vu Tran, Vinh-Loc Huynh, and Minh-Triet Tran. 2019. Predicting 2018. Bert: Pre-training of deep bidirectional transformers for lan- Media Memorability Using Deep Features with Attention and Recur- guage understanding. In Proceedings of the 2019 Conference of the North rent Network.. In Working Notes Proceedings of the MediaEval 2019 American Chapter of Association for Computational Linguitics: Human Workshop (CEUR Workshop Proceedings). Language Technologies, Vol. 1. 4171–4186. [20] Shuai Wang, Linli Yao, Jieting Chen, and Qin Jin. 2019. RUC at Media- [5] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Eval 2019: Video Memorability Prediction Based on Visual Textual 2019. Slowfast networks for video recognition. In Proc. IEEE/CVF and Concept Related Features.. In Working Notes Proceedings of the international conference on computer vision. 6202–6211. MediaEval 2019 Workshop (CEUR Workshop Proceedings). [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proc. IEEE conference on computer vision and pattern recognition. 770–778. [7] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, and others. 2017. CNN architectures for large- scale audio classification. In Proc. 2017 IEEE international conference on acoustics, speech and signal processing (icassp). IEEE, 131–135. [8] Phillip Isola, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. 2011. What makes an image memorable?. In Proc. IEEE CVPR 2011. 145–152. [9] Peiguang Jing, Yuting Su, Liqiang Nie, Huimin Gu, Jing Liu, and Meng Wang. 2018. A framework of joint low-rank and sparse regression for image memorability prediction. IEEE Trans. Circuits Syst. Video Technol. 29, 5 (2018), 1296–1309. [10] Aditya Khosla, Akhil S Raju, Antonio Torralba, and Aude Oliva. 2015. Understanding and predicting image memorability at a large scale. In Proceedings of the IEEE international conference on computer vision. 2390–2398. [11] Rukiye Savran Kiziltepe, Mihai Gabriel Constantin, Claire-Hélène Demarty, Graham Healy, Camilo Fosco, Alba García Seco de Herrera, Sebastian Halder, Bogdan Ionescu, Ana Matran-Fernandez, Alan F. Smeaton, and Lorin Sweeney. 2021. Overview of The MediaEval 2021 Predicting Media Memorability Task. In Working Notes Proceedings of the MediaEval 2021 Workshop. [12] Ricardo Kleinlein, Cristina Luna-Jiménez, Zoraida Callejas, and Fer- nando Fernández-Martínez. 2020. Predicting Media Memorability from a Multimodal Late Fusion of Self-Attention and LSTM Models. In Working Notes Proceedings of the MediaEval 2020 Workshop (CEUR Workshop Proceedings). [13] Phuc H Le-Khac, Ayush K Rai, Graham Healy, Alan F Smeaton, and Noel E O’Connor. 2020. Investigating Memorability of Dynamic Media. In Working Notes Proceedings of the MediaEval 2020 Workshop (CEUR Workshop Proceedings). [14] Roberto Leyva, Faiyaz Doctor, AG Seco de Herrera, and Sohail Sahab. 2019. Multimodal deep features fusion for video memorability pre- diction. In Working Notes Proceedings of the MediaEval 2019 Workshop (CEUR Workshop Proceedings). [15] Jonghwan Mun, Minsu Cho, and Bohyung Han. 2017. Text-guided attention model for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31.