=Paper= {{Paper |id=Vol-3181/paper18 |storemode=property |title=Cross-modal Interaction for Video Memorability Prediction |pdfUrl=https://ceur-ws.org/Vol-3181/paper18.pdf |volume=Vol-3181 |authors=Youwei Lu,Xiaoyu Wu |dblpUrl=https://dblp.org/rec/conf/mediaeval/LuW21 }} ==Cross-modal Interaction for Video Memorability Prediction== https://ceur-ws.org/Vol-3181/paper18.pdf
      Cross-modal Interaction for Video Memorability Prediction
                                                                 Youwei Lu1 , Xiaoyu Wu1
                                                       1 Communication University of China, China

                                                                  wowanglenageta@sina.com

ABSTRACT                                                                             try to introduce the approach of cross-modal interaction methods
It is important to select memorable videos from the huge amount of                   to the field of video memorability prediction.
videos, which can serve other fields, such as video summary, movie
production, etc. The Predicting Media Memorability task in Media-                    3     APPROACH
Eval2021 focuses on predicting how well a video is remembered.                       As we have previously described, visual, textual, and audio infor-
In this paper, we use a text-guided visual cross-modal guidance                      mation play an important role in the video memorability prediction
approach for the video memorability prediction task. Based on                        task. We therefore carefully considered the feature extraction steps
this, we use a late fusion approach to fuse features from multiple                   for each modality. At the same time, we argued that since the text
modalities and predict the final video memorability scores.                          was manually annotated based on the video content, there was
                                                                                     semantic consistency between the textual and visual content, and
                                                                                     since previous studies have shown that textual information played
1    INTRODUCTION                                                                    a role in memorability prediction tasks [3, 18], textual features
The image memorability task is already a relatively mature field,                    were used to guide the representation of visual features, and the
and much work has been proposed to study it [1, 8–10]. However,                      two modal features were interacted. After obtaining the features
the video memorability prediction task is a brand new task from an                   from each of the above three modalities, they were passed through
artificial intelligence perspective. For images, people may memorize                 several MLP structures and predicted their respective video memo-
a certain region in the image, which leads to high memorability                      rability scores. Finally, we used an adaptive score fusion strategy
scores. For videos, people may memorize certain frames in a video                    to fuse the scores of the three modalities.
and video memorability prediction is a more complex and difficult
task. The Predicting Media Memorability task in the MediaEval                        3.1    Visual Feature
2021 workshop [11] is designed for this purpose, with the aim of                     The 3D and 2D convolutional neural networks each have their own
investigating how to assess better the degree to which a video                       advantages when dealing with video contents. The 3D convolu-
gives a moment of memory. Video memorability scores are used to                      tional neural network takes into account the temporal features of
measure this metric. Over the past two years, work has been done                     the video, while the 2D convolutional neural network has a smaller
on the video memorability prediction task in the 2019 [14, 19] and                   number of parameters. We use a 3D convolutional neural network,
2020 [12, 13] editions of the task, where we looked at and considered                SlowFast [5], to extract features from the video as Global-level fea-
the advantages and disadvantages of other methods, and finally                       tures. We also use a ResNet-101 network [6] to extract features
we proposed our own method for predicting video memorability                         from the video frames. For each input video, we sample 8 frames
scores.                                                                              evenly. These video frame features are fed into the GRU network
                                                                                     [2] to solve the timing-independent problem, and the GRU network
2    RELATED WORK                                                                    outputs the features as Temporal-aware level features. Afterwards,
There are multiple attributes in videos, such as vision and audio,                   these features are fed into a 1D convolutional neural network with
which play important roles in video memorability prediction. Re-                     different convolutional kernel sizes 2,3,4,5 to sense visual features
searchers have used different methods to extract the features of                     of different local sizes, and the output of the 1D convolutional neu-
multiple modalities to obtain a good feature representation. For                     ral network is used as the Local level features. We splice the Global,
example, the authors in [19] tried to extract features of video frames               Temporal-aware, and Local level features as visual features.
using 2D convolution method, Inception-V3 and used them to com-
pose features of the whole video. Authors in [20] tried to extrat                    3.2    Textual Feature
textual features with Glove model [17], which is a common model                      We used the Bert model [4] to extract the textual features of the
used in the NLP field. Researchers in [12] used a VGGish model                       video. For each text, we first perform a word separation operation
[12] to extract audio features.                                                      and prefix each text with a [CLS] token. The features corresponding
   Cross-modal interaction approaches are widely used in the field                   to the last layer of [CLS] token in Bert was used as features for the
of computer vision. For example, in [15], textual features are used                  whole text. For videos with multiple texts, we average the features
to enhance the representation of visual features in the image cap-                   of multiple texts as the textual features corresponding to the video
tioning task and this is achieved with good results. We therefore                    because of the similarity of these texts.

Copyright 2021 for this paper by its authors. Use permitted under Creative Commons   3.3    Audio Feature
License Attribution 4.0 International (CC BY 4.0).
MediaEval’21, December 13-15 2021, Online                                            We used the VGGish model [7] to extract audio features. First, we
                                                                                     cut each video into segments without overlapping in 0.96s, and each
MediaEval’21, December 13-15 2021, Online                                                                                              Lu et al.


segment was fed into the VGGish network and a 128-D vector was           Table 1: Results of our method on TRECVid2021 Dataset
generated. We fed this vector into an MLP structure and predicted        validation set and test set
the video memorability score of the segment. We take the median
score of these segments as the audio stream video memorability                          Run                   test set (RC)    validation set (RC)
prediction score for the video.
                                                                               short-term w/(dev)                 0.113        0.330
                                                                              short-term w/o (dev)                0.123        0.432
3.4    Cross-modal Interaction                                            normalized short-term w/(dev)           0.106        0.296
With the visual and textual features already extracted above, we          normalized short-term w/o(dev)          0.132        0.462
used the textual features to interact with the visual features. For             long-term w/(dev)                 0.11         0.331
the visual features extracted above, we first cut them into M=8                long-term w/o(dev)                 0.037        0.298
segments and mapped the visual features and textual features into
the same semantic space. Afterwards, the mapped textual features         Table 2: Results of our method on Memento10k Dataset vali-
and each segment of visual features were integrated to calculate         dation set and test set
the weight of each segment of visual features. We used this weight
to weight and sum the cut M-segment visual features to obtain the
interacted features. Through this interaction, the visual features                    Run              test set (RC)   validation set (RC)
exploited the semantic consistency with the textual features to                  short-term               0.628        0.642
enhance the expressiveness of their own features.                            normalized short-term        0.649        0.655

3.5    Score Fusion
                                                                         this phonomenon. Additionally, in score fusion stage, visual feature
We trained simple MLP networks using visual, textual, and audio          occupies the greatest weight and textual feature is scondary to it.
features separately as regressors for predicting the video memora-           Table 2 shows the results of our method on the Memento10k
bility scores of the respective modalities. MLP network is composed      dataset. When training with the Memento10k dataset, we trained
of several fully connected layers and non-linear activation func-        our model using the officially published training/validation set par-
tions. Afterwards, an adaptive weight assignment strategy was used       titioning method. We should also explain that we did not use audio
to fuse the three scores. We varied the weights of each modality         features when training the Memento10k dataset, partly because
score in steps of 0.05, but ensured that the total weight sums to 1.     some of the videos lack audio, and partly because in [16] the au-
In this way, we fused the three scores and predicted the final video     thors did not use audio features, so we did not use audio features
memorability score.                                                      either. Our model achieves better performance on the Memento10k
                                                                         dataset, and we speculate that the reason for this is that more data
4     RESULTS AND ANALYSIS                                               allows for better training of the model and mitigates the effects of
In this section, we describe specifically how we used the TRECVid        overfitting.
and Memento10k dataset in our experiments and present the results
in Table 1 and Table 2 below. And this is followed by a brief analysis   5   DISCUSSION AND OUTLOOK
of the results of the experiments.                                       In this competition, we first extracted features from multiple modal-
   Table 1 shows the experimental results of our method on TRECVid       ities, then we used cross-modal interaction to enhance the repre-
2021. w/(dev) in the table means that the development set was used,      sentation of visual features, and finally we used late fusion to fuse
while w/o(dev) means that the development set was not used. This         the video memorability scores predicted by multiple modalities to
is because the development set was not officially released at the        obtain the final video memorability scores. In addition to this, we
beginning of the competition, so we only used the training set to        observed that optical flow was used to predict video memorability
train the model. When the development set was not used, we di-           scores in multiple methods, which is one of our future research
vided the training set of 590 videos into 479 as the training set and    directions. However, as optical flow is time-consuming and labour-
111 as the validation set to train our model. When the development       intensive, we did not use optical flow features in this experiment
set was used, we considered it unreasonable to use only 590 videos       for the time being.
as the training set and more than 1000 videos as the validation set,
considering that the development set contains nearly 1000 videos.        ACKNOWLEDGMENTS
So we mixed the training set and development set together and            This work is supported by National Natural Science Foundation of
divided the data set into training and validation sets at a ratio of     China (No. 61801441, No. 61701277,No. 61771288), National Key R
0.8/0.2. We believe that more data would be beneficial to the model.     & D plan of the 13th Five-Year plan (No. 2017YFC0821601), cross
We were surprised to find that when training a short-term video          media intelligence special fund of Beijing National Research Center
memorability prediction model, the model without the develop-            for Information Science and Technology ( No. BNR2019TD01022
ment set achieved better performance, both in terms of raw and           ), discipline construction project of “Beijing top-notch” discipline
normalized scores, while when training a long-term video mem-            (Internet information of Communication University of China) and
orability prediction model, using the development set improved           in part by the State Key Laboratory of Media Convergence and
the performance significantly. Now we do not know the reason for         Communication, Communication University of China.
Predicting Media Memorability                                                                          MediaEval’21, December 13-15 2021, Online


REFERENCES                                                                     [16] Anelise Newman, Camilo Fosco, Vincent Casser, Allen Lee, Barry Mc-
 [1] Erdem Akagunduz, Adrian G Bors, and Karla K Evans. 2019. Defining              Namara, and Aude Oliva. 2020. Multimodal memorability: Modeling
     image memorability using the visual memory schema. IEEE Trans.                 effects of semantics and decay on video memorability. In Computer
     Pattern Anal. Mach. Intell. 42, 9 (2019), 2165–2178.                           Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August
 [2] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry                  23–28, 2020, Proceedings, Part XVI 16. Springer, 223–240.
     Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014.        [17] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014.
     Learning phrase representations using RNN encoder-decoder for sta-             Glove: Global vectors for word representation. In Proc. 2014 conference
     tistical machine translation. In Proceedings of the Empiricial Methods         on empirical methods in natural language processing (EMNLP). 1532–
     in Natural Language Processing (EMNLP 2014).                                   1543.
 [3] Romain Cohendet, Karthik Yadati, Ngoc QK Duong, and Claire-Hélène         [18] Sumit Shekhar, Dhruv Singal, Harvineet Singh, Manav Kedia, and
     Demarty. 2018. Annotating, understanding, and predicƒting long-term            Akhil Shetty. 2017. Show and recall: Learning what makes videos
     video memorability. In Proc. 2018 ACM on International Conference on           memorable. In Proc. IEEE International Conference on Computer Vision
     Multimedia Retrieval. 178–186.                                                 Workshops. 2730–2739.
 [4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.         [19] Le-Vu Tran, Vinh-Loc Huynh, and Minh-Triet Tran. 2019. Predicting
     2018. Bert: Pre-training of deep bidirectional transformers for lan-           Media Memorability Using Deep Features with Attention and Recur-
     guage understanding. In Proceedings of the 2019 Conference of the North        rent Network.. In Working Notes Proceedings of the MediaEval 2019
     American Chapter of Association for Computational Linguitics: Human            Workshop (CEUR Workshop Proceedings).
     Language Technologies, Vol. 1. 4171–4186.                                 [20] Shuai Wang, Linli Yao, Jieting Chen, and Qin Jin. 2019. RUC at Media-
 [5] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He.            Eval 2019: Video Memorability Prediction Based on Visual Textual
     2019. Slowfast networks for video recognition. In Proc. IEEE/CVF               and Concept Related Features.. In Working Notes Proceedings of the
     international conference on computer vision. 6202–6211.                        MediaEval 2019 Workshop (CEUR Workshop Proceedings).
 [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep
     residual learning for image recognition. In Proc. IEEE conference on
     computer vision and pattern recognition. 770–778.
 [7] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke,
     Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A
     Saurous, Bryan Seybold, and others. 2017. CNN architectures for large-
     scale audio classification. In Proc. 2017 IEEE international conference
     on acoustics, speech and signal processing (icassp). IEEE, 131–135.
 [8] Phillip Isola, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. 2011.
     What makes an image memorable?. In Proc. IEEE CVPR 2011. 145–152.
 [9] Peiguang Jing, Yuting Su, Liqiang Nie, Huimin Gu, Jing Liu, and Meng
     Wang. 2018. A framework of joint low-rank and sparse regression
     for image memorability prediction. IEEE Trans. Circuits Syst. Video
     Technol. 29, 5 (2018), 1296–1309.
[10] Aditya Khosla, Akhil S Raju, Antonio Torralba, and Aude Oliva. 2015.
     Understanding and predicting image memorability at a large scale.
     In Proceedings of the IEEE international conference on computer vision.
     2390–2398.
[11] Rukiye Savran Kiziltepe, Mihai Gabriel Constantin, Claire-Hélène
     Demarty, Graham Healy, Camilo Fosco, Alba García Seco de Herrera,
     Sebastian Halder, Bogdan Ionescu, Ana Matran-Fernandez, Alan F.
     Smeaton, and Lorin Sweeney. 2021. Overview of The MediaEval 2021
     Predicting Media Memorability Task. In Working Notes Proceedings of
     the MediaEval 2021 Workshop.
[12] Ricardo Kleinlein, Cristina Luna-Jiménez, Zoraida Callejas, and Fer-
     nando Fernández-Martínez. 2020. Predicting Media Memorability
     from a Multimodal Late Fusion of Self-Attention and LSTM Models.
     In Working Notes Proceedings of the MediaEval 2020 Workshop (CEUR
     Workshop Proceedings).
[13] Phuc H Le-Khac, Ayush K Rai, Graham Healy, Alan F Smeaton, and
     Noel E O’Connor. 2020. Investigating Memorability of Dynamic Media.
     In Working Notes Proceedings of the MediaEval 2020 Workshop (CEUR
     Workshop Proceedings).
[14] Roberto Leyva, Faiyaz Doctor, AG Seco de Herrera, and Sohail Sahab.
     2019. Multimodal deep features fusion for video memorability pre-
     diction. In Working Notes Proceedings of the MediaEval 2019 Workshop
     (CEUR Workshop Proceedings).
[15] Jonghwan Mun, Minsu Cho, and Bohyung Han. 2017. Text-guided
     attention model for image captioning. In Proceedings of the AAAI
     Conference on Artificial Intelligence, Vol. 31.