Overview of MediaEval 2020 Predicting Media Memorability Task: What Makes a Video Memorable? Alba G. Seco de Herrera1 , Rukiye Savran Kiziltepe1 , Jon Chamberlain1 , Mihai Gabriel Constantin2 , Claire-Hélène Demarty3 , Faiyaz Doctor1 , Bogdan Ionescu2 , Alan F. Smeaton4 1 University of Essex, UK 2 University Politehnica of Bucharest, Romania 3 InterDigital, R&I, France 4 Dublin City University, Ireland. alba.garcia@essex.ac.uk ABSTRACT attracted increasing attention since the seminal work of Isola et This paper describes the MediaEval 2020 Predicting Media Mem- al. [7]. Models have achieved very good results at predicting image orability task. After first being proposed at MediaEval 2018, the memorability [8, 15] and we have recently started to see the use of Predicting Media Memorability task is in its 3rd edition this year, techniques like style transfer to improve image memorability [13] as the prediction of short-term and long-term video memorability thus illustrating that we have now moved from just measuring (VM) remains a challenging task. In 2020, the format remained the memorability, to using memorability as an evaluation metric. same as in previous editions. This year the videos are a subset of the In contrast, research on visual memorability (VM) from a com- TRECVid 2019 Video-to-Text dataset, containing more action rich puter science point of view is in its early stage. Recently we have video content as compared with the 2019 task. In this paper a de- seen other work on video memorability [11] with a particular focus scription of some aspects of this task is provided, including its main on short term, but the scarcity of studies on VM can be explained characteristics, a description of the collection, the ground truth by several reasons. Firstly, there is no publicly available data set dataset, evaluation metrics and the requirements for participants’ to train and test models, though the VideoMem [12] and the Me- run submissions. mento10k [11] datasets are recent additions. The second point, closely related to the previous one, is the lack of a common defini- tion for VM. Regarding modelling, previous attempts at predicting 1 INTRODUCTION VM [3, 12] have highlighted several features which contribute to Media platforms such as social networks, media advertisements, the prediction of VM, such as semantic, saliency and colour fea- information retrieval and recommendation systems deal with expo- tures, but the work is far from complete and our capacity to propose nential growth. Enhancing the relevance of multimedia occurrences effective computational models will help to meet the challenge of in our everyday lives requires new ways to organise – in particular, VM prediction. to retrieve – digital content. Like other video metrics of importance, The goal of this task is to participate in the harmonisation and such as aesthetics or interestingness, memorability can be regarded the advancement of this emerging multimedia field. Furthermore, in as a useful aspect to help make a choice between competing videos. contrast to previous work on image memorability prediction, where This is even truer when one considers specific use cases of creating memorability was measured a few minutes after memorisation, we commercials or educational content. Because the impact of differ- propose a dataset with longer term memorability annotations. We ent multimedia content, images or videos, on human memory is expect the predictions of the models trained on this data to be more unequal, the capability of predicting the memorability of a given representative of long-term memory, which is used preferably in piece of video content is of high importance for professionals in numerous applications. the field of advertising and other fields. Beyond advertising, other applications, such as film-making, education, content retrieval, etc., may also be influenced by this task. 3 TASK DESCRIPTION The Predicting Media Memorability task addresses this problem. The Predicting Media Memorability task requires participants to The task is part of the MediaEval benchmark and, following the automatically predict memorability scores for short form videos, success of previous editions [2, 4], creates a common benchmarking that reflect the probability for a video to be remembered. Partici- protocol and provides a ground truth dataset for short-term and pants were provided with a dataset of videos with short-term and long-term memorability using common definitions. long-term memorability annotations, related information, and pre- extracted state-of-the-art visual features. Therefore, two subtasks 2 RELATED WORK were proposed to participants: The computational understanding of video memorability follows on from the study of image memorability prediction, which has ● Short-term VM prediction - scores were measured a few minutes after the memorisation process; Copyright 2020 for this paper by its authors. Use permitted under Creative ● Long-term VM prediction - scores were measured 24-72 Commons License Attribution 4.0 International (CC BY 4.0). MediaEval’20, 14-15 December 2020, Online hours after the memorisation process. MediaEval’20, December 2020, Online A.G. Seco de Herrera et al. In the game of video memorability, participants are expected to watch 180 and 120 videos in short-term and long-term memorisation steps, respectively. The task is basically to press the space bar once the participants recognise a previously seen video, which enables to determine videos recognised and not recognised by them. In the first step of the game, 40 target videos are repeated after a few minutes to collect short-term memorability labels. As for filler videos in the first step, 60 non-vigilance filler videos are displayed once. 20 vigilance filler videos are repeated after a few seconds to check participants’ attention to the task. After 24 hours to 72 hours, the same participants are expected to attend the second Figure 1: A sample of frames of the videos in the TRECVid step for collecting long-term memorability labels. This time, 40 2019 Video-to-Text dataset. target videos chosen randomly from among non-vigilance fillers of the first step and 80 fillers selected randomly from new videos are displayed to measure long-term memorability scores for those target videos. Both short-term and long-term memorability scores 4 COLLECTION are calculated as the percentage of correct recognition for each The dataset is composed of a subset of short videos selected from video by the participants. Relevant screenshots and label collection the TRECVid 2019 Video-to-Text dataset [1] (see Figure 1). These procedures are demonstrated on the MediaEval task web page [10]. videos are shared under Creative Commons licenses that allow their redistribution. The TRECVid videos have much more action 5 SUBMISSION AND EVALUATION happening in them compared with those in the 2019 VM task, and As in previous editions of the task, each team is required to predict thus they correspond to more generic use cases. both short and long term memorability. In total, 10 runs can be Each video consists of a coherent unit in terms of meaning and is submitted, 5 for each. For the two required runs, all information associated with two scores of memorability that refer to its probabil- can be used in the development of the system, meaning provided ity to be remembered after two different time durations of memory features, ground truth data, video sample titles, features extracted retention. A set of pre-extracted features are also distributed: from the visual content and even external data. The only exception, ● image-level features: AlexNetFC7 [9], HOG [5], HSVHist, in this case, is that the required short-term and long-term memora- RGBHist, LBP [6], VGGFC7 [14]; bility runs must not use each other’s score annotations. For the rest ● video-level feature: C3D [16]. of the runs, a maximum of 4 per subtask, everything is permitted, including using cross-annotations between the subtasks. The image-level features were extracted from 3 frames for each The outputs of the prediction models – i.e., the predicted mem- video: the first, the middle and the last frame. In addition, each orability scores for the videos – will be compared with ground TRECVid video is accompanied by two textual captions describ- truth memorability scores using classic evaluation metrics (e.g., ing the activity. Additional information on the annotation was Spearman’s rank correlation). also provided to allow further investigation of the user interaction for memorability. Hence, the annotations collected from partici- pants were provided including the first appearance position and 6 DISCUSSION AND OUTLOOK the second appearance position of each target video along with the In this paper we presented the third edition of the Predicting Me- response time of the user and the key pressed when watching each dia Memorability at the MediaEval 2020 Benchmarking initiative. video. The task provides a framework that allows a comparative study of The TRECVid 2019 Video-to-Text dataset [1] contains 6,000 different state of the art Machine Learning approaches aiming to videos. In 2020, three subsets were distributed as part of the Media- predict short and long-term memorability. A collection of videos Eval Predicting Media Memorability task. The training set con- is provided as well as memorability annotations and a common tained 590 videos, the development set 410 videos and the test set evaluation metric. In addition, related information has been pro- 500 videos. Each video was annotated by at least 16 annotators for vided to help participants in developing their approaches. Details their short term memorability. However, there are fewer long term regarding the methods employed by participants and their results annotations. can be found in the proceedings of the 2020 MediaEval workshop1 . Similar to previous editions of the task [2, 4], memorability has been measured using recognition tests, i.e., through an objective ACKNOWLEDGMENTS measure, a few minutes after the memorisation of the videos (short This work was part-funded by NIST Award No. 60NANB19D155, by term), and then 24 to 72 hours later (long term). The ground truth Science Foundation Ireland under grant number SFI/12/RC/2289_P2 dataset was collected by using a video memorability game protocol and under project AI4Media, A European Excellence Centre for proposed by Cohendet et al. [3]. Two versions of the memorability Media, Society and Democracy, H2020 ICT-48-2020, grant 951911. game were published. One was published on Amazon Mechani- cal Turk (AMT) and another one was issued for general use with 1 following three language options: English, Spanish and Turkish. See CEUR Workshop Proceedings (CEUR-WS.org). Predicting Media Memorability MediaEval’20, December 2020, Online REFERENCES [16] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and [1] George Awad, Asad A Butt, Keith Curtis, Yooyoung Lee, Jonathan Manohar Paluri. 2015. Learning spatiotemporal features with 3d con- Fiscus, Afzal Godil, Andrew Delgado, Jesse Zhang, Eliot Godard, Lukas volutional networks. In Proceedings of the IEEE International Conference Diduch, and others. 2019. TRECVID 2019: An Evaluation Campaign to on Computer Vision. 4489–4497. Benchmark Video Activity Detection, Video Captioning and Matching, and Video Search & Retrieval. (2019). [2] Romain Cohendet, Claire-Hélène Demarty, Ngoc Duong, Mats Sjöberg, Bogdan Ionescu, and Thanh-Toan Do. 2018. MediaEval 2018: Predict- ing media memorability task. In Working Notes Proceedings of the MediaEval 2018 Workshop. Sophia Antipolis, France. [3] Romain Cohendet, Claire-Hélène Demarty, Ngoc QK Duong, and Mar- tin Engilberge. 2019. VideoMem: Constructing, Analyzing, Predicting Short-term and Long-term Video Memorability. In Proceedings of the IEEE International Conference on Computer Vision. 2531–2540. [4] Mihai Gabriel Constantin, Bogdan Ionescu, Claire-Hélène Demarty, Ngoc QK Duong, Xavier Alameda-Pineda, and Mats Sjöberg. 2019. Predicting Media Memorability Task at MediaEval 2019. In Working Notes Proceedings of the MediaEval 2019 Workshop. Sophia Antipolis, France. [5] Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1. IEEE, 886– 893. [6] Dong-Chen He and Li Wang. 1990. Texture unit, texture spectrum, and texture analysis. IEEE Transactions on Geoscience and Remote Sensing 28, 4 (1990), 509–512. [7] Phillip Isola, Jianxiong Xiao, Devi Parikh, Antonio Torralba, and Aude Oliva. 2013. What makes a photograph memorable? IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 7 (2013), 1469–1482. [8] Aditya Khosla, Akhil S Raju, Antonio Torralba, and Aude Oliva. 2015. Understanding and predicting image memorability at a large scale. In Proceedings of the IEEE International Conference on Computer Vision. 2390–2398. [9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Im- agenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097–1105. [10] MediaEval. 2020. MediaEval 2020: Predicting Media Memorabil- ity. (2020). https://multimediaeval.github.io/editions/2020/tasks/ memorability/ Accessed: 2020-11-26. [11] Anelise Newman, Camilo Fosco, Vincent Casser, Allen Lee, Barry Mc- Namara, and Aude Oliva. 2020. Multimodal Memorability: Modeling Effects of Semantics and Decay on Video Memorability. In Computer Vision – ECCV 2020, Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer International Publishing, Cham, 223–240. [12] Sumit Shekhar, Dhruv Singal, Harvineet Singh, Manav Kedia, and Akhil Shetty. 2017. Show and recall: Learning what makes videos memorable. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 2730–2739. [13] Aliaksandr Siarohin, Gloria Zen, Cveta Majtanovic, Xavier Alameda- Pineda, Elisa Ricci, and Nicu Sebe. 2019. Increasing Image Memo- rability with Neural Style Transfer. ACM Trans. Multimedia Com- put. Commun. Appl. 15, 2, Article 42 (June 2019), 22 pages. https: //doi.org/10.1145/3311781 [14] Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolu- tional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations. [15] Hammad Squalli-Houssaini, Ngoc QK Duong, Marquant Gwenaëlle, and Claire-Hélène Demarty. 2018. Deep learning for predicting im- age memorability. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2371–2375.