Overview of MediaEval 2020 Predicting Media Memorability
              Task: What Makes a Video Memorable?
    Alba G. Seco de Herrera1 , Rukiye Savran Kiziltepe1 , Jon Chamberlain1 , Mihai Gabriel Constantin2 ,
              Claire-Hélène Demarty3 , Faiyaz Doctor1 , Bogdan Ionescu2 , Alan F. Smeaton4
                                                                       1
                                                                        University of Essex, UK
                                                        2
                                                            University Politehnica of Bucharest, Romania
                                                                     3
                                                                       InterDigital, R&I, France
                                                                 4
                                                                   Dublin City University, Ireland.
                                                                      alba.garcia@essex.ac.uk

ABSTRACT                                                                            attracted increasing attention since the seminal work of Isola et
This paper describes the MediaEval 2020 Predicting Media Mem-                       al. [7]. Models have achieved very good results at predicting image
orability task. After first being proposed at MediaEval 2018, the                   memorability [8, 15] and we have recently started to see the use of
Predicting Media Memorability task is in its 3rd edition this year,                 techniques like style transfer to improve image memorability [13]
as the prediction of short-term and long-term video memorability                    thus illustrating that we have now moved from just measuring
(VM) remains a challenging task. In 2020, the format remained the                   memorability, to using memorability as an evaluation metric.
same as in previous editions. This year the videos are a subset of the                  In contrast, research on visual memorability (VM) from a com-
TRECVid 2019 Video-to-Text dataset, containing more action rich                     puter science point of view is in its early stage. Recently we have
video content as compared with the 2019 task. In this paper a de-                   seen other work on video memorability [11] with a particular focus
scription of some aspects of this task is provided, including its main              on short term, but the scarcity of studies on VM can be explained
characteristics, a description of the collection, the ground truth                  by several reasons. Firstly, there is no publicly available data set
dataset, evaluation metrics and the requirements for participants’                  to train and test models, though the VideoMem [12] and the Me-
run submissions.                                                                    mento10k [11] datasets are recent additions. The second point,
                                                                                    closely related to the previous one, is the lack of a common defini-
                                                                                    tion for VM. Regarding modelling, previous attempts at predicting
1    INTRODUCTION                                                                   VM [3, 12] have highlighted several features which contribute to
Media platforms such as social networks, media advertisements,                      the prediction of VM, such as semantic, saliency and colour fea-
information retrieval and recommendation systems deal with expo-                    tures, but the work is far from complete and our capacity to propose
nential growth. Enhancing the relevance of multimedia occurrences                   effective computational models will help to meet the challenge of
in our everyday lives requires new ways to organise – in particular,                VM prediction.
to retrieve – digital content. Like other video metrics of importance,                  The goal of this task is to participate in the harmonisation and
such as aesthetics or interestingness, memorability can be regarded                 the advancement of this emerging multimedia field. Furthermore, in
as a useful aspect to help make a choice between competing videos.                  contrast to previous work on image memorability prediction, where
This is even truer when one considers specific use cases of creating                memorability was measured a few minutes after memorisation, we
commercials or educational content. Because the impact of differ-                   propose a dataset with longer term memorability annotations. We
ent multimedia content, images or videos, on human memory is                        expect the predictions of the models trained on this data to be more
unequal, the capability of predicting the memorability of a given                   representative of long-term memory, which is used preferably in
piece of video content is of high importance for professionals in                   numerous applications.
the field of advertising and other fields. Beyond advertising, other
applications, such as film-making, education, content retrieval, etc.,
may also be influenced by this task.                                                3   TASK DESCRIPTION
   The Predicting Media Memorability task addresses this problem.                   The Predicting Media Memorability task requires participants to
The task is part of the MediaEval benchmark and, following the                      automatically predict memorability scores for short form videos,
success of previous editions [2, 4], creates a common benchmarking                  that reflect the probability for a video to be remembered. Partici-
protocol and provides a ground truth dataset for short-term and                     pants were provided with a dataset of videos with short-term and
long-term memorability using common definitions.                                    long-term memorability annotations, related information, and pre-
                                                                                    extracted state-of-the-art visual features. Therefore, two subtasks
2    RELATED WORK                                                                   were proposed to participants:
The computational understanding of video memorability follows
on from the study of image memorability prediction, which has                             ● Short-term VM prediction - scores were measured a few
                                                                                            minutes after the memorisation process;
Copyright 2020 for this paper by its authors. Use permitted under Creative                ● Long-term VM prediction - scores were measured 24-72
Commons License Attribution 4.0 International (CC BY 4.0).
MediaEval’20, 14-15 December 2020, Online
                                                                                            hours after the memorisation process.
MediaEval’20, December 2020, Online                                                                                     A.G. Seco de Herrera et al.


                                                                            In the game of video memorability, participants are expected to
                                                                         watch 180 and 120 videos in short-term and long-term memorisation
                                                                         steps, respectively. The task is basically to press the space bar once
                                                                         the participants recognise a previously seen video, which enables
                                                                         to determine videos recognised and not recognised by them. In
                                                                         the first step of the game, 40 target videos are repeated after a
                                                                         few minutes to collect short-term memorability labels. As for filler
                                                                         videos in the first step, 60 non-vigilance filler videos are displayed
                                                                         once. 20 vigilance filler videos are repeated after a few seconds
                                                                         to check participants’ attention to the task. After 24 hours to 72
                                                                         hours, the same participants are expected to attend the second
Figure 1: A sample of frames of the videos in the TRECVid                step for collecting long-term memorability labels. This time, 40
2019 Video-to-Text dataset.                                              target videos chosen randomly from among non-vigilance fillers
                                                                         of the first step and 80 fillers selected randomly from new videos
                                                                         are displayed to measure long-term memorability scores for those
                                                                         target videos. Both short-term and long-term memorability scores
4   COLLECTION                                                           are calculated as the percentage of correct recognition for each
The dataset is composed of a subset of short videos selected from        video by the participants. Relevant screenshots and label collection
the TRECVid 2019 Video-to-Text dataset [1] (see Figure 1). These         procedures are demonstrated on the MediaEval task web page [10].
videos are shared under Creative Commons licenses that allow
their redistribution. The TRECVid videos have much more action           5       SUBMISSION AND EVALUATION
happening in them compared with those in the 2019 VM task, and           As in previous editions of the task, each team is required to predict
thus they correspond to more generic use cases.                          both short and long term memorability. In total, 10 runs can be
   Each video consists of a coherent unit in terms of meaning and is     submitted, 5 for each. For the two required runs, all information
associated with two scores of memorability that refer to its probabil-   can be used in the development of the system, meaning provided
ity to be remembered after two different time durations of memory        features, ground truth data, video sample titles, features extracted
retention. A set of pre-extracted features are also distributed:         from the visual content and even external data. The only exception,
      ● image-level features: AlexNetFC7 [9], HOG [5], HSVHist,          in this case, is that the required short-term and long-term memora-
        RGBHist, LBP [6], VGGFC7 [14];                                   bility runs must not use each other’s score annotations. For the rest
      ● video-level feature: C3D [16].                                   of the runs, a maximum of 4 per subtask, everything is permitted,
                                                                         including using cross-annotations between the subtasks.
The image-level features were extracted from 3 frames for each
                                                                            The outputs of the prediction models – i.e., the predicted mem-
video: the first, the middle and the last frame. In addition, each
                                                                         orability scores for the videos – will be compared with ground
TRECVid video is accompanied by two textual captions describ-
                                                                         truth memorability scores using classic evaluation metrics (e.g.,
ing the activity. Additional information on the annotation was
                                                                         Spearman’s rank correlation).
also provided to allow further investigation of the user interaction
for memorability. Hence, the annotations collected from partici-
pants were provided including the first appearance position and          6       DISCUSSION AND OUTLOOK
the second appearance position of each target video along with the       In this paper we presented the third edition of the Predicting Me-
response time of the user and the key pressed when watching each         dia Memorability at the MediaEval 2020 Benchmarking initiative.
video.                                                                   The task provides a framework that allows a comparative study of
   The TRECVid 2019 Video-to-Text dataset [1] contains 6,000             different state of the art Machine Learning approaches aiming to
videos. In 2020, three subsets were distributed as part of the Media-    predict short and long-term memorability. A collection of videos
Eval Predicting Media Memorability task. The training set con-           is provided as well as memorability annotations and a common
tained 590 videos, the development set 410 videos and the test set       evaluation metric. In addition, related information has been pro-
500 videos. Each video was annotated by at least 16 annotators for       vided to help participants in developing their approaches. Details
their short term memorability. However, there are fewer long term        regarding the methods employed by participants and their results
annotations.                                                             can be found in the proceedings of the 2020 MediaEval workshop1 .
   Similar to previous editions of the task [2, 4], memorability has
been measured using recognition tests, i.e., through an objective        ACKNOWLEDGMENTS
measure, a few minutes after the memorisation of the videos (short       This work was part-funded by NIST Award No. 60NANB19D155, by
term), and then 24 to 72 hours later (long term). The ground truth       Science Foundation Ireland under grant number SFI/12/RC/2289_P2
dataset was collected by using a video memorability game protocol        and under project AI4Media, A European Excellence Centre for
proposed by Cohendet et al. [3]. Two versions of the memorability        Media, Society and Democracy, H2020 ICT-48-2020, grant 951911.
game were published. One was published on Amazon Mechani-
cal Turk (AMT) and another one was issued for general use with
                                                                         1
following three language options: English, Spanish and Turkish.              See CEUR Workshop Proceedings (CEUR-WS.org).
Predicting Media Memorability                                                                                 MediaEval’20, December 2020, Online


REFERENCES                                                                    [16] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and
 [1] George Awad, Asad A Butt, Keith Curtis, Yooyoung Lee, Jonathan                Manohar Paluri. 2015. Learning spatiotemporal features with 3d con-
     Fiscus, Afzal Godil, Andrew Delgado, Jesse Zhang, Eliot Godard, Lukas         volutional networks. In Proceedings of the IEEE International Conference
     Diduch, and others. 2019. TRECVID 2019: An Evaluation Campaign to             on Computer Vision. 4489–4497.
     Benchmark Video Activity Detection, Video Captioning and Matching,
     and Video Search & Retrieval. (2019).
 [2] Romain Cohendet, Claire-Hélène Demarty, Ngoc Duong, Mats Sjöberg,
     Bogdan Ionescu, and Thanh-Toan Do. 2018. MediaEval 2018: Predict-
     ing media memorability task. In Working Notes Proceedings of the
     MediaEval 2018 Workshop. Sophia Antipolis, France.
 [3] Romain Cohendet, Claire-Hélène Demarty, Ngoc QK Duong, and Mar-
     tin Engilberge. 2019. VideoMem: Constructing, Analyzing, Predicting
     Short-term and Long-term Video Memorability. In Proceedings of the
     IEEE International Conference on Computer Vision. 2531–2540.
 [4] Mihai Gabriel Constantin, Bogdan Ionescu, Claire-Hélène Demarty,
     Ngoc QK Duong, Xavier Alameda-Pineda, and Mats Sjöberg. 2019.
     Predicting Media Memorability Task at MediaEval 2019. In Working
     Notes Proceedings of the MediaEval 2019 Workshop. Sophia Antipolis,
     France.
 [5] Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients
     for human detection. In 2005 IEEE Computer Society Conference on
     Computer Vision and Pattern Recognition (CVPR’05), Vol. 1. IEEE, 886–
     893.
 [6] Dong-Chen He and Li Wang. 1990. Texture unit, texture spectrum, and
     texture analysis. IEEE Transactions on Geoscience and Remote Sensing
     28, 4 (1990), 509–512.
 [7] Phillip Isola, Jianxiong Xiao, Devi Parikh, Antonio Torralba, and Aude
     Oliva. 2013. What makes a photograph memorable? IEEE Transactions
     on Pattern Analysis and Machine Intelligence 36, 7 (2013), 1469–1482.
 [8] Aditya Khosla, Akhil S Raju, Antonio Torralba, and Aude Oliva. 2015.
     Understanding and predicting image memorability at a large scale. In
     Proceedings of the IEEE International Conference on Computer Vision.
     2390–2398.
 [9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Im-
     agenet classification with deep convolutional neural networks. In
     Advances in Neural Information Processing Systems. 1097–1105.
[10] MediaEval. 2020. MediaEval 2020: Predicting Media Memorabil-
     ity. (2020). https://multimediaeval.github.io/editions/2020/tasks/
     memorability/ Accessed: 2020-11-26.
[11] Anelise Newman, Camilo Fosco, Vincent Casser, Allen Lee, Barry Mc-
     Namara, and Aude Oliva. 2020. Multimodal Memorability: Modeling
     Effects of Semantics and Decay on Video Memorability. In Computer
     Vision – ECCV 2020, Andrea Vedaldi, Horst Bischof, Thomas Brox, and
     Jan-Michael Frahm (Eds.). Springer International Publishing, Cham,
     223–240.
[12] Sumit Shekhar, Dhruv Singal, Harvineet Singh, Manav Kedia, and
     Akhil Shetty. 2017. Show and recall: Learning what makes videos
     memorable. In Proceedings of the IEEE International Conference on
     Computer Vision Workshops. 2730–2739.
[13] Aliaksandr Siarohin, Gloria Zen, Cveta Majtanovic, Xavier Alameda-
     Pineda, Elisa Ricci, and Nicu Sebe. 2019. Increasing Image Memo-
     rability with Neural Style Transfer. ACM Trans. Multimedia Com-
     put. Commun. Appl. 15, 2, Article 42 (June 2019), 22 pages. https:
     //doi.org/10.1145/3311781
[14] Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolu-
     tional Networks for Large-Scale Image Recognition. In International
     Conference on Learning Representations.
[15] Hammad Squalli-Houssaini, Ngoc QK Duong, Marquant Gwenaëlle,
     and Claire-Hélène Demarty. 2018. Deep learning for predicting im-
     age memorability. In 2018 IEEE International Conference on Acoustics,
     Speech and Signal Processing (ICASSP). IEEE, 2371–2375.