HCMUS at MediaEval2021: Attention-based Hierarchical Fusion Network for Predicting Media Memorability E-Ro Nguyen1,3 , Hai-Dang Huynh-Lam1,3 , Hai-Dang Nguyen1,3 , Minh-Triet Tran1,2,3 1 University of Science, VNU-HCM, 2 John von Neumann Institute, VNU-HCM 3 Vietnam National University, Ho Chi Minh city, Vietnam {nero,nhdang}@selab.hcmus.edu.vn hlhdang19@apcs.fitus.edu.vn,tmtriet@fit.hcmus.edu.vn ABSTRACT Predicting Media Memorability is a task offered by The Benchmark- ing Initiative for Multimedia Evaluation in the set of challenges for the MediaEval 2021 Workshop. This task aims at predicting the memorability of visual media to explore the possibility of automated Share weights supporting systems in multiple areas of application such as adver- tisement, recommendations, education, and more. To approximate the memorability score of media, we employ an attention-based fusion network with a hierarchical structure that resembles binary Share Share computation trees with the embedding of root nodes used to com- weights weights pute the final memorability score. Memorability score Share weights 1 INTRODUCTION The task of Predicting Media Memorability at MediaEval 2021 [8] requires participants to automatically predict the probability that a human may remember a specific visual media of type video after a specified time period. This task offered us two datasets for the a man cracks training and evaluation of our methods, namely the Memento10k an egg dataset [10] with short term memorability and the TRECVid dataset [2] with both short term and long term scores. To aid readers in understanding our approach, we organize our paper as follow: Section 2 visits some prior works with concepts Figure 1: Overview of our proposed method AHFNet. related to our approach that might help readers gain preliminary knowledge; Section 3 introduces the proposed architecture as well as elaborating details about our network; Section 4 provides detailed 3 APPROACH results of our runs together with multiple insights that guided us Figure 1 show an overview of our proposed method. The core through our experiments; Section 5 discuss about the conclusion of of our method is the attention-based hierarchical fusion network our research and possible future approaches based on our method. (AHFNet). Its mechanism is to repeat the process of pairing and fusing two consecutive visual features into a high-level semantic 2 RELATED WORK feature by a fusion module according to the hierarchical structure In Predicting Media Memorability task, participants need to ap- as a binary tree from the leaf to the root node. proximate the probability of each video sample being memorized by human and hence, this task may be categorized as video regres- 3.1 Fusion Module sion with input being 4D features sampled from each video [5, 7]. We propose a Fusion Module to fuse two visual features into one Regression and classification on video has long been studied in based on the attention mechanism, given those two features are academic literature [1, 13, 14] with many achievements recently computed using similar method on different inputs. As illustrated when Transformer-based architecture of neural networks [12] being in figure 2, for any two visual features 𝑉1, 𝑉2 ∈ R𝐻 Γ—π‘Š ×𝐢 , we first applied on this category [9, 11]. add a positional encoding 𝑃𝐸 ∈ R𝐻 Γ—π‘Š to each of two features and employ a multi-headed self-attention [12] which is responsible for learning the association or correlation among the targets within Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). each current frame: MediaEval’21, December 13-15 2021, Online 𝑆𝐴(𝑉 ) = 𝑀𝐻𝐴(𝑉 + 𝑃𝐸, 𝑉 + 𝑃𝐸, 𝑉 + 𝑃𝐸) (1) MediaEval’21, December 13-15 2021, Online E-Ro et al. Short term Long term Metrics Caption Normalised Raw Raw π‘½πŸ Spearman No 0.06 0.066 0.013 Position Encoding (higher better) Yes 0.069 0.101 0.059 π‘½πŸ Pearson No 0.085 0.1 -0.023 𝑳 (higher better) Yes 0.101 0.11 0.067 MSE No 0.02 0.01 0.04 𝑽 ∈ ℝ$ Γ— % Γ— &" % ∈ ℝ$ Γ— % Γ— &" 𝑽 (lower better) Yes 0.02 0.01 0.06 Β· 𝑨 ∈ ℝ𝑡 Γ— 𝑯𝑾 π“Œ ∈ ℝ 𝑡 Table 1: Results of our Subtask 01 on TRECVid test set for Short Normalised, Short Raw and Long Raw memorability 𝑳 ∈ ℝ𝑡 Γ— π‘ͺ𝑳 … 𝒍 ∈ ℝπ‘ͺ𝑳 Short term Figure 2: Illustration of our Fusion module. Metrics Caption Normalised Raw Spearman No 0.508 0.516 (higher better) Yes 0.473 0.456 where 𝑉 ∈ {𝑉1, 𝑉2 }. Pearson No 0.531 0.534 For consistency, 𝑉1 = 𝑓 (𝑉ˆ1 ) and π‘Œ = 𝑓 (𝑉ˆ2 ) where 𝑉ˆ1, 𝑉ˆ2 ∈ (higher better) Yes 0.476 0.461 R𝑇𝑓 ×𝐻 Γ—π‘Š ×𝐢 are inputs while 𝑓 is the computation acts on subset MSE No 0.01 0.01 of 𝑉 used to compute 𝑉1 and 𝑉2 , respectively. (lower better) Yes 0.01 0.01 The cross-attention between two visual features are then taken Table 2: Results of our Subtask 01 on Memento10k test set before fusing both of them into a single feature, helping the network for Short Normalised, Short Raw memorability learn the relationship of those two: 𝐢𝐴(𝑋, π‘Œ ) = 𝑀𝐻𝐴(𝑆𝐴(𝑋 ), 𝑆𝐴(π‘Œ ), 𝑆𝐴(π‘Œ )) (2) where 𝑋, π‘Œ ∈ {𝑉1, 𝑉2 }. 4 RESULTS AND ANALYSIS The fusion operator is then applied to merge two visual features We have 2 different runs for each dataset (TRECVid, Memento10k) into single one: with each type of score (short raw, short normalised, long raw) 𝑉 β€² = 𝐹𝑒𝑠𝑒 (𝐢𝐴(𝑉1, 𝑉2 ), 𝐢𝐴(𝑉2, 𝑉1 )) (3) in Subtask 01. The first run of each is the AHFNet without the where 𝐹𝑒𝑠𝑒 may be any reduction operator. text captions (AHFNetWTC), and the second is the full version In our approach, we adopt the summation as Fusion operator: of AHFNet. Table 1 and 2 show our results on the TRECVid and Memento10k, respectively. 𝑉 β€² = 𝐢𝐴(𝑉1, 𝑉2 ) + 𝐢𝐴(𝑉2, 𝑉1 ) (4) With our experiments, we observe that the raw short term is almost better than the normalised one for both datasets. Our 3.2 Hierarchical Fusion Network AHFNetWTC is better on the Memento10k test set. On the TRECVid We extracted 8 frames from each video, which were used for our test set, however, our AHFNet achieves higher results in all met- image-based feature extraction. The ResNet-50 [6] (pre-trained on rics and score types. These better results can be explained that ImageNet) was used to extract a 2048-dimensional feature vector our network extracts the TRECVid’s text captions better than the for each frame. And then, we make pair of the features of the frame Memento10k. (1, 2), (3, 4), (5, 6), (7, 8) and then fuse each pair with a fusion module to achieve a higher semantic feature from each pair. Then, 5 CONCLUSION the number of features is reduced by half. We continue doing the This paper describes a hierarchical fusion network with the attention- same process until the final feature is fused. The final feature has based proposed for the 2021 Predicting Media Memorability task of the video’s high-level information that can now be used to predict MediaEval. The main contributions of this paper are to propose a fu- the memorability score. sion module to capture the high-level semantics of two consecutive In the figure 1 we show only the short version with only 4 frames frames, leverage the binary hierarchical structure to fuse the video’s with 2-levels of fusion module. However, our work uses 8 frames features and highlight the visual features by the corresponding text with 3-levels of fusion module. caption. In the future, we plan to conduct this task with additional features 3.3 Cross-Modal With Text Captions like audio in videos and a more robust feature extractor. So that We use a pre-trained BERT [3] to extract the linguistic features of can extract high-level features from dynamic videos effectively. each video’s text caption. These features are inserted into Fusion Module to highlight the visual features that are matched with cor- ACKNOWLEDGMENTS responding linguistic clues by the CMEM module (Cross-Modal This work was funded by Gia Lam Urban Development and Invest- Excitation Modulation) [4]. The CMEM module is illustrated as a ment Company Limited, Vingroup and supported by Vingroup In- violet component in Figure 2. novation Foundation (VINIF) under project code VINIF.2019.DA19. Predicting Media Memorability MediaEval’21, December 13-15 2021, Online REFERENCES [1] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Apostol (Paul) Nat- sev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vi- jayanarasimhan. 2016. YouTube-8M: A Large-Scale Video Classifi- cation Benchmark. In arXiv:1609.08675. https://arxiv.org/pdf/1609. 08675v1.pdf [2] George Awad, Asad A. Butt, Keith Curtis, Yooyoung Lee, Jonathan Fiscus, Afzal Godil, Andrew Delgado, Jesse Zhang, Eliot Godard, Lukas Diduch, Alan F. Smeaton, Yvette Graham, Wessel Kraaij, and Georges Quenot. 2020. TRECVID 2019: An Evaluation Campaign to Benchmark Video Activity Detection, Video Captioning and Matching, and Video Search Retrieval. (2020). arXiv:cs.CV/2009.09984 [3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Lan- guage Understanding. CoRR abs/1810.04805 (2018). arXiv:1810.04805 http://arxiv.org/abs/1810.04805 [4] Zihan Ding, Tianrui Hui, Shaofei Huang, Si Liu, Xuan Luo, Junshi Huang, and Xiaoming Wei. 2021. Progressive Multimodal Interaction Network for Referring Video Object Segmentation. The 3rd Large- scale Video Object Segmentation Challenge, Workshop in conjunction with CVPR 2021 (virtual). (June 2021). [5] Christoph Feichtenhofer. 2020. X3D: Expanding Architectures for Efficient Video Recognition. (2020). arXiv:cs.CV/2004.04730 [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385 (2015). [7] Hirokatsu Kataoka, Tenga Wakamiya, Kensho Hara, and Yutaka Satoh. 2020. Would Mega-scale Datasets Further Enhance Spatiotemporal 3D CNNs? CoRR abs/2004.04968 (2020). arXiv:2004.04968 https://arxiv. org/abs/2004.04968 [8] Rukiye Savran Kiziltepe, Mihai Gabriel Constantin, Claire-HΓ©lΓ¨ne Demarty, Graham Healy, Camilo Fosco, Alba GarcΓ­a Seco de Herrera, Sebastian Halder, Bogdan Ionescu, Ana Matran-Fernandez, Alan F. Smeaton, and Lorin Sweeney. 2021. Overview of The MediaEval 2021 Predicting Media Memorability Task. In Working Notes Proceedings of the MediaEval 2021 Workshop. [9] Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. 2020. HERO: Hierarchical Encoder for Video+ Language Omni- representation Pre-training. In EMNLP. [10] Anelise Newman, Camilo Fosco, Vincent Casser, Allen Lee, Barry McNamara, and Aude Oliva. 2020. Multimodal Memorability: Mod- eling Effects of Semantics and Decay on Video Memorability. (2020). arXiv:cs.CV/2009.02568 [11] Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. VideoBERT: A Joint Model for Video and Language Representation Learning. (2019). arXiv:cs.CV/1904.01766 [12] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010. [13] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2019. Temporal Segment Networks for Action Recognition in Videos. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 11 (Nov 2019), 2740–2755. https://doi.org/ 10.1109/TPAMI.2018.2868668 [14] Chao-Yuan Wu, Ross B. Girshick, Kaiming He, Christoph Feichten- hofer, and Philipp Krahenbuhl. 2020. A Multigrid Method for Effi- ciently Training Video Models. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 150–159.