Overview of The MediaEval 2023 Predicting Video
                                Memorability Task
                                Mihai Gabriel Constantin1 , Claire-Hélène Demarty2 , Camilo Fosco3 , Alba García Seco
                                de Herrera4 , Sebastian Halder4 , Graham Healy5 , Bogdan Ionescu1 ,
                                Ana Matran-Fernandez4 , Rukiye Savran Kiziltepe6 , Alan F. Smeaton5 and
                                Lorin Sweeney5
                                1
                                  University Politehnica of Bucharest, Romania
                                2
                                  InterDigital, France
                                3
                                  Massachusetts Institute of Technology Cambridge, USA
                                4
                                  University of Essex, UK
                                5
                                  Dublin City University, Ireland
                                6
                                  Karadeniz Technical University, Turkey


                                                                         Abstract
                                                                         This paper describes the sixth edition of the Predicting Video Memorability task, part of the MediaEval1
                                                                         multimedia evaluation benchmark initiative. Similar to previous editions, we use video data and an-
                                                                         notations from two datasets, the Memento10k, and the VideoMem datasets. In light of the consistent
                                                                         performance plateau observed in previous iterations of the prediction task, in which participants were
                                                                         required to train and test on the same dataset, we have taken the decision to drop the prediction task from
                                                                         this year’s competition. This modification allows participants the opportunity to redirect their efforts
                                                                         toward more challenging tasks. Therefore, for this edition we propose two tasks: the generalization task,
                                                                         where participants are required to train on one dataset and test their results on a different dataset, and the
                                                                         EEG task, where participants are required to predict memorability using EEG-related data. In this paper
                                                                         we present the main aspects of the 2023 Predicting Video Memorability task, exploring the proposed
                                                                         tasks, the datasets, evaluation methods and metrics, as well as the requirements for participants.


                                1. Introduction
                                Multimedia processing systems bear the formidable task of accurately predicting and correlating
                                a vast array of media content with the intricacies of the human cognitive process. This role
                                places them at the heart of media retrieval and media recommendation systems, where the
                                fusion of computer vision, deep learning and cognitive sciences is of paramount importance in
                                providing useful and insightful results. In this context, memorability is one of the most important
                                aspects of human cognition that is explored by researchers from various domains. Defined as
                                the likelihood that a certain piece of multimedia content will be remembered and recognized on
                                subsequent viewing, memorability and the question “what makes a video memorable?” is still
                                an open research question.
                                   The 2023 MediaEval Predicting Video Memorability task attempts to answer some of these
                                questions, proposing a common evaluation benchmark for models that target memorability
                                prediction for videos. This represents the sixth edition of this task, following the success and
                                learning from the patterns and lessons discovered in previous editions of the memorability
                                task. Thus, this task has continually evolved and adapted throughout the editions, taking into
                                1
                                https://multimediaeval.github.io/
                                MediaEval’23: Multimedia Evaluation Workshop, February 1–2, 2024, Amsterdam, The Netherlands and Online
                                $ mihai.constantin84@upb.ro (M. G. Constantin)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
account the general trends of results, observations with regards to the data, annotations and
ground truth, as well as valuable participant feedback.


2. Related work
Several key works in the study of human perception of multimedia data have shown not only
an astonishing capacity for memorization from human viewers [1], but also that people tend
to retain very specific characteristics and details of the visual samples they are shown [2]. In
this context, numerous works have analyzed memorability from a computer vision standpoint,
targeting images [3, 4] and videos [5, 6]. Important developments in this domain also target
the use of physiological data like fMRI [5] and EEG [7]. Furthermore, researchers have studied
different sets of low- and high-level human understandable attributes and their correlation with
memorability, including but not limited to the presence of certain objects [8], photographic
quality and emotions [9], and natural scene types [10].
   The Predicting Video Memorability task builds upon these findings and initial ideas, and,
in the previous five editions [11] has featured short- and long-term video memorability tasks,
multiple datasets including Memento10k [12], VideoMem [13], and a memorability annotated
subset of the 2019 TRECVid Video-to-Text dataset [14], each dataset featuring various modalities,
including visual, audio, and textual. Multiple facets of memorability prediction are studied
throughout the editions, manifested as three different subtasks: (i) a prediction subtask, which
asks participants to train their models on the training and validation subsets of one dataset,
and submit their runs for testing on the same dataset, (ii) the generalization subtask, where
participants train and validate their models on one dataset, and test generalization properties
on the testing subset of another dataset, and (iii) an EEG-based subtask where participants must
use EEG data in order to infer whether a certain viewer will memorize or not a given video.
Results thus far show some interesting trends. For the prediction subtask, results seem to reach a
plateau around Spearman’s rank correlation values of 0.7, making us theorize that the maximum
possible performance or values very close to that maximum potential have been reached. On
the other hand, results for the generalization task show lower performance. This leaves a lot
of room for development in this area of memorability, showing the need for researching less
dataset-specific systems. Finally, the EEG task, while it only has one full edition in 2022 and a
pilot edition in 2021, has shown some promising initial results.


3. Task description
Given the performance plateau registered on the prediction task, and the problems participants’
systems had on the generalization task, for this edition we propose to drop the prediction subtask,
thus allowing participants to focus the Generalization of memorability predictor systems (Sub-
task 1). We also continue the EEG-based prediction task (Sub-task 2), given its encouraging
start in the previous edition of MediaEval.

3.1. Subtask 1: Generalization
Sub-task 1 deals with the Generalization of memorability predictor systems, thus testing them
in a challenging scenario, but a scenario that would be closer to real-world applications. Partici-
pants are asked to train and validate their systems on the training and devset sections of the
Memento10k dataset, and submit their runs and predictions on the testset split of VideoMem.
Participants are allowed a maximum of 5 runs for this task. One of the runs must consist of
systems trained only with Memento10k data, while the other four can augment the training
dataset in any way the participants feel is necessary, as long as they do not use VideoMem data.

3.2. Subtask 2: EEG-based prediction
Participants must create systems that can automatically predict whether a given human subject
will remember a certain video or not on subsequent viewing, starting from the provided EEG
data. For each video, in addition to the specific EEG features, we also provide the identifier of
the volunteer, the label, and the id of the video that was being watched, so features from the
video available for the other subtasks can be used. There is an obligation however to include
EEG data in each system the participants develop for this task.


4. Datasets
This edition of the memorability task uses three datasets for its two subtasks. In the generaliza-
tion subtask, the Memento10k dataset is provided and used for system training and validation,
while the VideoMem dataset is used for system testing. On the other hand, the EEG subtask
uses human physiological data from the EEGMem dataset, that consists of human subjects’ EEG
responses recorded while watching videos from the Memento10k dataset. This Section presents
these datasets, the data they encompass, annotation protocols, and the features we provide
associated with each dataset.
    The following set of pre-extracted features are provided along with the Memento10k and
VideoMem datasets, namely: (i) image-level features: AlexNetFC7 [15], HOG [16], HSVHist,
RGBHist, LBP [17], VGGFC7 [18], DenseNet121 [19], ResNet50 [20], EfficientNetB3 [21]; and
(ii) video-level features: C3D [22].
    Given the different nature and modality of the EEG data, a different set of features is computed
and provided for this data: ERPs (i.e., EEG amplitudes at the start of the video), ERSPs (features
in the time-frequency domain, spanning the whole duration of the video), and images (also
conveying time-frequency information, but appropriate for feeding into a CNN or some other
sort of computer vision system).

4.1. Memento10k
The Memento10k dataset is an extensive and comprehensive dataset for investigating and
analysing video memorability. The dataset consists of a collection of 10,000 three-second real-
world video clips sourced from the Internet. Each video is accompanied by corresponding
short-term memorability scores, memorability decay values, action labels, and five human-
annotated captions. This dataset comprehensively encompasses the concept of memorability
throughout a range of presentation delays, from seconds to minutes. This provides valuable
insights into the temporal dynamics of memorability and how it changes over time. The short-
term memorability scores are derived from "Memento: The Video Memory Game" experimental
approach [9], involving crowdworkers tasked with identifying repeated videos, and are based
on their responses. On average, each video clip has been annotated with 90 annotations and the
dataset has a high level of human consistency, as indicated by a Spearman’s rank correlation
coefficient of 0.73.
   From the Memento10k dataset [12] we will provide the training (7000 video samples) and
validation (1500 video samples) sets, which will be used as the official training and validation
(or development) sets of MediaEval 2023 Predicting Video Memorability task.
4.2. VideoMem
The VideoMem is a large-scale dataset composed of 10,000 soundless seven-second videos
created to predict short-term and long-term video memorability. The video clips were obtained
from a collection of cinematic raw stock footage, including different scenes from animals, food,
nature, people, and transportation. Every video is accompanied by a caption or its original title
with short-term and long-term memorability scores. The dataset aims to facilitate research
focused on understanding the memorability of videos and assessing methodologies for predicting
multimedia content memorability. A novel annotation protocol is demonstrated and both short-
term and long-term memorability performances are measured via recognition tests conducted
shortly after viewing the videos and 24-72 hours later, respectively [13].
   From the VideoMem dataset the testing set (2000 video samples) will be provided and used as
the official training set for the competition.

4.3. EEGMem
The EEGMem [7] dataset is composed of EEG data collected from 12 subjects while watching a
subset of the Memento10k [12] dataset. The subjects are then asked to watch the same videos
through a custom-built online portal between 24–72 hours after the video-EEG recording session,
indicating whether they have recognised a video.


5. Evaluation
Two different metrics will be used as the main metrics for the proposed subtasks. Subtask 1 -
generalization will use three metrics, namely Spearman’s rank correlation, Pearson correlation,
and mean squared error. Similar to the previous editions of the Memorability task, Spearman’s
rank correlation will be used as the official main metric for subtask 1. Subtask 2 - EEG will use
the Area Under the Receiver Operating Characteristic Curve as the official metric.


6. Conclusions
This paper presents the sixth edition of the MediaEval Predicting Video Memorability task.
This year’s edition proposes two subtasks, one based on the generalization of memorability
prediction systems, and another one based on EEG data generated through the anaysis of human
subjects.


Acknowledgements
Financial support provided under project AI4Media, a European Excellence Centre for Media,
Society and Democracy, H2020 ICT-48-2020, grant #951911.


References
 [1] R. N. Shepard, Recognition memory for words, sentences, and pictures, Journal of Verbal Learning
     and Verbal Behavior 6 (1967) 156–163.
 [2] T. F. Brady, T. Konkle, G. A. Alvarez, A. Oliva, Visual long-term memory has a massive storage
     capacity for object details, Proceedings of the National Academy of Sciences 105 (2008) 14325–14329.
 [3] Y. Baveye, R. Cohendet, M. Perreira Da Silva, P. Le Callet, Deep learning for image memorability
     prediction: The emotional bias, in: Proceedings of the 24th ACM International Conference on
     Multimedia, 2016, pp. 491–495.
 [4] J. Fajtl, V. Argyriou, D. Monekosso, P. Remagnino, Amnet: Memorability estimation with attention.
     arxiv 2018, arXiv preprint arXiv:1804.03115 (1804).
 [5] J. Han, C. Chen, L. Shao, X. Hu, J. Han, T. Liu, Learning computational models of video memorability
     from fMRI brain imaging, IEEE Transactions on Cybernetics 45 (2014) 1692–1703.
 [6] S. Shekhar, D. Singal, H. Singh, M. Kedia, A. Shetty, Show and recall: Learning what makes videos
     memorable, in: Proceedings of the IEEE International Conference on Computer Vision Workshops,
     2017, pp. 2730–2739.
 [7] L. Sweeney, A. Matran-Fernandez, S. Halder, A. G. S. de Herrera, A. Smeaton, G. Healy, Overview
     of the EEG pilot subtask at MediaEval 2021: predicting media memorability, arXiv preprint
     arXiv:2201.00620 (2021).
 [8] M. A. Kramer, M. N. Hebart, C. I. Baker, W. A. Bainbridge, The features underlying the memorability
     of objects, Science Advances 9 (2023) eadd2981.
 [9] P. Isola, D. Parikh, A. Torralba, A. Oliva, Understanding the intrinsic memorability of images,
     Advances in Neural Information Processing Systems 24 (2011).
[10] J. Lu, M. Xu, R. Yang, Z. Wang, Understanding and predicting the memorability of outdoor natural
     scenes, IEEE Transactions on Image Processing 29 (2020) 4927–4941.
[11] R. Savran Kiziltepe, M. G. Constantin, C.-H. Demarty, G. Healy, C. Fosco, A. Garcia Seco De Herrera,
     S. Halder, B. Ionescu, A. Matran-Fernandez, A. F. Smeaton, et al., Overview of the MediaEval 2021
     predicting media memorability task, in: MediaEval Workshop 2021, CEUR Workshop Proceedings,
     volume 3181, 2021.
[12] A. Newman, C. Fosco, V. Casser, A. Lee, B. McNamara, A. Oliva, Multimodal memorability: Modeling
     effects of semantics and decay on video memorability, in: Computer Vision–ECCV 2020: 16th
     European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, Springer, 2020,
     pp. 223–240.
[13] R. Cohendet, C.-H. Demarty, N. Q. Duong, M. Engilberge, Videomem: Constructing, analyzing, pre-
     dicting short-term and long-term video memorability, in: Proceedings of the IEEE/CVF International
     Conference on Computer Vision, 2019, pp. 2531–2540.
[14] G. Awad, A. A. Butt, K. Curtis, Y. Lee, J. Fiscus, A. Godil, A. Delgado, J. Zhang, E. Godard, L. Diduch,
     et al., Trecvid 2019: An evaluation campaign to benchmark video activity detection, video captioning
     and matching, and video search & retrieval, arXiv preprint arXiv:2009.09984 (2020).
[15] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural
     networks, Communications of the ACM 60 (2017) 84–90.
[16] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: 2005 IEEE Computer
     Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, IEEE, 2005,
     pp. 886–893.
[17] D.-C. He, L. Wang, Texture unit, texture spectrum, and texture analysis, IEEE Transactions on
     Geoscience and Remote Sensing 28 (1990) 509–512.
[18] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition,
     arXiv preprint arXiv:1409.1556 (2014).
[19] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely connected convolutional networks,
     in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognitionn, 2017, pp.
     4700–4708.
[20] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of
     the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[21] M. Tan, Q. Le, Efficientnet: Rethinking model scaling for convolutional neural networks, in:
     International Conference on Machine Learning, PMLR, 2019, pp. 6105–6114.
[22] D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning spatiotemporal features with 3d
     convolutional networks, in: Proceedings of the IEEE International Conference on Computer Vision,
     2015, pp. 4489–4497.